Identification of DNA-binding proteins using support vector machines and evolutionary profiles
© Kumar et al. 2007
Received: 04 June 2007
Accepted: 27 November 2007
Published: 27 November 2007
Skip to main content
© Kumar et al. 2007
Received: 04 June 2007
Accepted: 27 November 2007
Published: 27 November 2007
Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation, as these proteins play a crucial role in gene-regulation. In this paper, we developed various SVM modules for predicting DNA-binding domains and proteins. All models were trained and tested on multiple datasets of non-redundant proteins.
SVM models have been developed on DNAaset, which consists of 1153 DNA-binding and equal number of non DNA-binding proteins, and achieved the maximum accuracy of 72.42% and 71.59% using amino acid and dipeptide compositions, respectively. The performance of SVM model improved from 72.42% to 74.22%, when evolutionary information in form of PSSM profiles was used as input instead of amino acid composition. In addition, SVM models have been developed on DNAset, which consists of 146 DNA-binding and 250 non-binding chains/domains, and achieved the maximum accuracy of 79.80% and 86.62% using amino acid composition and PSSM profiles. The SVM models developed in this study perform better than existing methods on a blind dataset.
A highly accurate method has been developed for predicting DNA-binding proteins using SVM and PSSM profiles. This is the first study in which evolutionary information in form of PSSM profiles has been used successfully for predicting DNA-binding proteins. A web-server DNAbinder has been developed for identifying DNA-binding proteins and domains from query amino acid sequenceshttp://www.imtech.res.in/raghava/dnabinder/.
DNA-binding proteins (DNA-BPs) are very important constituent of both eukaryotic and prokaryotic proteomes. It has been reported that approximately 2–3% of prokaryotic and 6–7% of eukaryotic proteins can bind to DNA[1, 2]. These proteins play important roles in DNA packaging, replication, transcription regulation and other activities associated with DNA. Hence proteins that target specific DNA sequences can be a potential therapeutics for genetic diseases and cancers. In the form of restriction enzymes, DNA-BPs play a crucial role in prokaryotic host defence. Due to different functions DNA-BPs are diverse group of proteins both in terms of amino acid sequences and three-dimensional structures. Hence, identification of DNA-BPs can play a vital role in proteome annotation and understanding an important class of proteins.
In past, several methods have been developed for predicting DNA-BPs. Broadly, these methods can be divided into two categories i) prediction from protein structure and ii) prediction from amino acid sequence. Structure based prediction methods discriminate DNA-binding and non-binding proteins with high accuracy on the basis of positively charged electrostatic patches, DNA-binding structural motifs, protein sequence composition, solvent accessibility and secondary structure, net charge, dipole and quadrapole moments, and size of largest positive surface patch and amino acid composition. Unfortunately, these methods can't be used in high throughput annotation, as they require the structure of a protein for prediction. Cai and Lin 2003, used pseudo-amino acid composition as input for support vector machine (SVM) to discriminate RNA, rRNA and DNA binding proteins from non-binding proteins (NBPs). Recently Yu et al. (2006) has developed a SVM based method for prediction of rRNA, RNA and DNA binding proteins. They used a feature vector of dimension 132, which includes amino acid composition and composition of physico-chemical properties.
In this work, a systematic attempt has been made to predict DNA-BPs from their amino acid sequences using various features of proteins, like amino acid composition. First, we analyzed the amino acid composition of DNA-binding proteins and based on the observation, SVM models have been developed using amino acid, dipeptide and four-part amino acid compositions of proteins. Besides composition, we also developed SVM models using PSSM profiles obtained from PSI-BLAST. We also examined the performance of similarity search (BLAST and PSI-BLAST) and motif-finding (MEME/MAST) approaches. All models developed in this study were evaluated using five-fold cross validation technique.
The performance of SVM models developed using different types of compositions. These models were trained and tested on DNAset, a dataset of DNA-binding and non-binding protein domains/chains.
4-parts amino acids
The performance of SVM models developed on DNAaset and evaluated using five-fold cross-validation technique.
4-parts amino acids
The performance of PSSM based SVM models on proteins with and without PSI-BLAST hits at e-value 0.1 against DNAset. These models were trained and tested on DNAset.
With PSI-BLAST hit
Without PSI-BLAST hit
It is important to examine the performance of the newly developed model on an independent dataset. In this study, we evaluated the performance of our SVM models (trained on DNAset) on independent dataset called DNAiset, which consists of 100 NBPs (BindN_testsp) and 92 DNA-BPs (BindN_testpdb) obtained from Wang and Brown (2006). At default threshold of 0.1, our PSSM based SVM model correctly predicted 70 out of 92 DNA-BPs and 89 out of 100 NBPs (See Additional file1, Table S3). This demonstrates that our SVM model performs equally well on independent dataset.
The Performance of SVM models using amino acid and PSSM profiles on a realistic dataset (DNArset).
Amino acid composition based model
PSSM based model
It is important to compare the performance of newly developed method with existing methods in order to demonstrate its capabilities. We compared the performance of above SVM models with similarity and motif-based approaches, which are commonly used for functional annotation of proteomes. As shown in Additional file1, Table S4, the sensitivity of both modules DNA-BLAST and DNA-PSIBLAST was about 10% at e-value of 0.1, when evaluated on DNAset using five fold cross-validation technique. We also evaluated the performance of MEME/MAST on DNAset and observed poor performance, where only 15 domains showed motif at e-value cut-off of 1. These results demonstrated that our SVM models perform better than commonly used techniques like BLAST, MEME/MAST.
Most of the existing methods predict DNA binding proteins from proteins structures. These structure based prediction methods are not suitable for high throughput genome annotation, as they require the structure of a protein. Best of authors knowledge, three methods have been developed in the past for predicting DNA-BPs from their amino acid sequences. Ahmad et al. (2004) developed a neural network based method using amino acid composition and reported the accuracy of 64.5%. They used small but clean and non-redundant dataset, which consists of 62 DNA-BPs obtained from PDB and 915 non-binding proteins obtained from Swiss-Prot. Their dataset was similar to our main dataset (DNAset), where we achieved the maximum accuracy of 79.80% using amino acid composition and 86.62% using PSSM profiles. Cai and Lin (2003) developed a SVM based method using pseudo-amino acid composition and obtained the average accuracy of 81%. Although they compiled a large dataset, it was neither non-redundant nor clean because it contained all proteins including probable DNA-BPs. Yu et al. (2006) developed a SVM based method using various physical-chemical properties and showed an average accuracy of 71.64%. They collected binding and non-binding proteins as proposed by Cai and Lin (2003) from Swiss-Prot and created a non-redundant dataset of 1153 DNA-BPs and 1153 non-binding proteins. On alternate dataset (DNAaset) which was identical to Yu et. al. (2006), we achieved an accuracy of 74.22% using PSSM based SVM model. This demonstrates that our method performs better than other existing methods. The usage of SVM, which performs better than ANN, particularly on small dataset and evolutionary information in the form of PSSM profiles, improved the performance of the present method.
In this study, we developed modules for predicting DNA binding proteins using two types of datasets; i) DNAset consists of partial sequences (binding regions) or DNA binding domains, and ii) DNAaset consists of full-length DNA-binding proteins. The question arises whether modules trained on DNAset (domains or partial sequences) will be applicable for predicting full DNA-binding protein or vice versa. We predicted the proteins in DNAaset using amino acid composition based SVM module trained on DNAset and achieved the accuracy of about 55%. Similarly, we predicted proteins in DNAset using amino acid composition based SVM module trained on DNAaset and achieved the accuracy of about 63%. This revealed that the model trained on partial sequences or domains is not valid for predicting full DNA binding proteins and vice versa. In order to understand the reason of failure, we computed and compared amino acid compositions of DNA binding and non-binding proteins in DNAaset (Figure S3). The DNA-binding and non DNA-binding proteins have significantly different amino acid compositions in DNAset whereas such a trend is not observed in DNAaset (Figure1 & S3). It is due to the fact that DNAaset has full length DNA-binding proteins, which may also have other domains including non DNA-binding domains. Thus, prediction performance of methods trained and tested on DNAset are more accurate than on DNAaset. It also explains the reasons for the failure of methods trained on full proteins (DNAaset) and tested with domain dataset (DNAset), and vice versa. This shows that separate methods are necessary for predicting DNA-binding domains and DNA-binding proteins.
It is also possible that our models were over trained. In order to rule-out this possibility, we evaluated the performance with other existing methods. We found two methods available for public: (i) DBS-PRED developed on DNA-binding domains and (ii) SVM-Prot developed on full DNA-binding proteins. The performance of DBS-PRED was evaluated on a dataset of 100 DNA binding and 100 non-binding proteins extracted from Swiss-Prot. As shown in Additional file1, Table S5, we achieved the maximum performance of 63% at 30% probability threshold. This means that the performance on DBS-PRED is also poor on full proteins. Similarly, the performance of SVM-Prot was evaluated on DNAset; SVM-Prot predicted 49 out of 146 DNA-binding proteins and 203 out of 250 non-binding proteins when all hits are considered for evaluation; if only top hit was considered SVM-Prot was able to identify13 out of 146 sequences (DNA binding domains/regions) as DNA-binding proteins. This analysis showed that SVM-Prot developed on full protein is not suitable for partial or domain sequences.
The prediction method described in this paper is implemented in the form of a web-server DNAbinder (for detail descriptions please see section Availability and Requirements). The common gateway interface script of DNAbinder is written using PERL version 5.03. DNAbinder server is installed on a Sun Server (420E) under UNIX (Solaris 7) environment. This server allows users to predict DNA-binding proteins using amino acid composition and PSSM based SVM models trained on DNAset, DNArset and DNAaset. Models trained on DNAset or DNArset are suitable for predicting DNA-binding domains/chains, whereas models trained on DNAset are suitable for predicting DNA-binding full length proteins. Server allows submission of multiple sequences for prediction in case of composition based model and submission of one sequence at a time in case of PSSM based model. In case of PSSM based model, we use this model if PSI-BLAST finds significant hits for query sequence; otherwise only simple composition based model is used for prediction.
DNA-binding proteins are one of the major classes of proteins playing a central role in cellular metabolism. Due to their importance in the regulation of gene-expression and other processes, several methods have been developed for predicting DNA-BPs. Most of them predict DNA-BPs using their structural information. These methods have limited scope because the structures are unknown for most of the proteins. In this paper, we developed a highly accurate method for predicting DNA-BPs from their amino acid sequences. The reliability of any prediction method mainly depends on clean and valid dataset. In past, two types of datasets have been used: (i) small and clean dataset with experimentally validated DNA-BPs obtained from PDB and (ii) large dataset, where sequences were obtained from Swiss-Prot[7, 8]. In this study we developed our models on two datasets called DNAset or main dataset and DNAaset or alternate dataset in order to benchmark our newly developed method with existing methods. Most of our study was based on DNAset because it has clean and experimentally validated DNA-BPs where as DNAaset may also include fragments and putative proteins.
In any functional annotation, the primary step is to search a query protein against database of annotated proteins (e.g. Swiss-Prot) and assign the function if query protein has significant similarity with target proteins. Similarity based annotation is highly accurate if an experimentally annotated homologous protein is found. But the major challenge is to predict function of a protein in absence of significant similarity. Thus, we developed a method using non-redundant dataset, where similarity between proteins in test and training datasets is very low. We applied similarity search and motif finding techniques on our dataset and found poor performance, as expected. Hence, we have developed ANN and SVM models on DNAset using amino acid and dipeptide compositions for predicting DNA-BPs. We achieved an accuracy of 68.46% from ANN model using amino acid composition, which is slightly better than Ahmad et al., 2004 (See Additional file1, Table S6). It may be due to the increase in the number of DNA-BPs in DNAset. It is interesting to note that the performance of SVM was significantly better than ANN. Thus, we developed rest of the models using SVM. We achieved significantly high accuracy of 79.80% using SVM model based on simple amino acid composition. This demonstrates the importance of machine learning techniques in prediction of DNA-BPs. In order to enhance the performance further, SVM models have been developed using dipeptide and four-part compositions. Unexpectedly, the performance of dipeptides and four-part composition based SVM models is poorer than composition-based SVM model. The standard techniques like BLAST, PSI-BLAST and MEME/MAST failed to predict DNA-BPs when tested on DNAset due to low similarity among domains/chains in DNAset.
This is the first study that used evolutionary information to discriminate DNA-BPs from non-binding proteins. We extracted evolutionary information of a protein from PSSM profiles obtained form PSI-BLAST search against "nr". The accuracy of the method improved significantly from 79.80% to 86.62% using PSSM profiles, when evaluated on DNAset. The quality of PSSM profiles, of a query sequence depends on similar sequences in a target database. As shown in Table3, performance of SVM based model decreases significantly when PSSM generated from DNAset instead of "nr" dataset. This is due to fact that "nr" is a very large database in comparison to DNAset and hence the chances of getting similar sequences are very high when a sequence is searched against "nr" database. Thus, quality of PSSM profiles will be superior in case of "nr". We also examined PSI-BLAST hits and observed that each protein in DNAset has BLAST hits when searched against "nr" database. We also demonstrate that PSSM based models perform better, if there is significant BLAST hits (Table3). As database of "nr" is growing exponentially due to number of sequencing projects the quality of PSSM profiles will improve over the times, which eventually improve the performance of PSSM based models. In DNAbinder server, first we examine whether a query sequence has any BLAST hit in "nr"; if yes, PSSM based model will be used and otherwise amino acid composition based SVM model will be used for prediction.
It is well known that similarity/motif based methods perform better thanab-initioclassifiers if a query protein has significant similarity with target proteins. The major challenge is to develop a method, which can identify a novel DNA binding protein even if it has no sequence similarity with any of the known DNA-BPs. Hence, we took non-redundant proteins in our main dataset. If we have taken redundant proteins then similarity based method would have performed better than other methods. Our SVM model based on amino acid composition performed well on a dataset where similarity based method failed to detect DNA-BPs. This demonstrates that models developed in this study are capable to identify novel DNA-BPs. The SVM model based on PSSM further improved the accuracy by 6–7%. The question arises why we generate PSSM from 'nr' protein database instead of our own dataset whereas we are assessing BLAST/PSI-BLAST and MEME/MAST on DNAset. There is a fundamental difference between BLAST/PSI-BLAST searching and SVM model using PSSM profiles. In case of BLAST/PSI-BLAST one need to search only against well annotated proteins because we have to assign the same function to query protein if it has similarity with target protein. In case of PSSM it is not important to search a query sequence against the annotated proteins only because we do not assign function based on similarity. It is not important whether query sequence has similarity with DNA-BPs, NBPs or un-annotated proteins for generating PSSM profiles. Thus, one may create PSSM from any database like "nr" in contrast to similarity-based methods, where we need to search against well-annotated proteins only.
The main objective of this study is to develop a prediction method for identification of new DNA-binding proteins and particularly annotating the newly sequenced genomes. Most of the existing methods developed in past have been trained on experimentally annotated DNA-binding protein chains/domains obtained from PDB. The models developed on dataset of protein chains like DNAset are suitable for predicting DNA binding domains but not for predicting DNA-BPs. In order to develop method for predicting DNA-binding proteins, we also developed SVM model using DNAaset, consists of full-length DNA-binding and non-binding proteins. Thus, models developed on DNAaset will be suitable for predicting DNA-BPs. The performance of our SVM model was slightly better than SVM model of Yu et al, 2006, using amino acid composition on same dataset. The difference in the performance of SVM may be due to optimization of learning parameters as Yu et al., 2006 used the default parameters of SVM_light where as we used optimized parameters. Our SVM model using PSSM profiles performs better than the SVM model of Yu et al. 2006, based on physicochemical properties by 3%. These results suggest that evolutionary information is important for predicting DNA-binding proteins.
We developed a highly accurate method for predicting DNA-BPs using the machine learning technique, SVM. For the first time, evolutionary information has been used to predict DNA-binding proteins. It has been observed that PSSM based models perform better than any other models by 3–7% on all the datasets including independent and realistic datasets. The SVM models developed in this study perform better than other existing methods. One of the major features of this study is that we developed a publicly available web server and stand-alone software, which allows users to identify the DNA-BPs in their dataset of proteins. It was observed that models trained on DNA domains or partial sequences are not suitable for predicting DNA binding proteins and vice versa. Our server, DNAbinder allows users to identify DNA binding domains using the model trained on DNAset and prediction of DNA binding proteins using the model trained on DNAaset. We hope this study will assist the biologist in annotation of genomes.
In this study, we used the following datasets to develop various models for predicting DNA-BPs and for evaluating SVM models.
We extracted 2435 DNA-BPs from Protein Data Bank (PDB) using the keywords, "Protein-DNA complex", "DNA binding" and "DNA binding proteins". All proteins having no DNA chain or having high similarity with other proteins were filtered. Finally, we got 146 non-redundant DNA-BPs in which no two proteins have the sequence identity of more than 25%. A non-redundant set of 250 non-binding proteins was obtained from Stawiski et al. (2003). They used following criteria: i) no two protein chains have similarity more than 25% and (ii) the approximate size and electrostatics are similar to DNA-BPs. Final dataset called DNAset or main dataset or domain dataset, consists of 146 DNA-binding and 250 non-binding protein chains or domains. We called proteins chains as domains for our convenience, in order to discriminate these PDB chains from full-length DNA-binding proteins obtained from Swiss-Prot.
In addition to main dataset (DNAset), we also created an alternate dataset called DNAaset. This dataset consists of 1153 DNA-BPs and 1153 NBPs extracted from Yu et al (2006). The parent dataset have 88 rRNA-BPs, 377 RNA-BPs, 1153 DNA-BPs and 17779 non-binding proteins. We randomly picked 1153 NBPs and all 1153 DNA-BPs to constitute the alternate dataset (DNAaset). This is non-redundant dataset where no two proteins have more than 25% similarity.
In order to evaluate performance of our models on dataset not used for training or testing, we created an independent dataset called DNAiset. This dataset has 92 DNA-binding protein chains obtained from PDB and 100 non-binding proteins picked from Swiss-Prot. These proteins were obtained from BindN server.
Equal number of negative and positives examples is important for developing an efficient classifier. They are also important for evaluating any prediction model where one can simply calculate accuracy for measuring performance. All above datasets have nearly equal number of DNA-binding and non-binding proteins. However, in real life DNA-BPs are significantly less than non-binding proteins. This raises question whether models developed on equal numbers will be effective in real life. Thus, we created a more realistic dataset called DNArset. This dataset has 146 DNA-BPs used in DNAset and 1500 NBPs. These 1500 NBPs were extracted from 17779 non-binding proteins used by Yu et al., 2006 after removing the proteins, whose DNA binding property is not experimentally validated.
We have adopted five fold cross-validation approach to evaluate the performance of all models developed in this study. In this procedure, the whole dataset is randomly divided into five equal parts. Four sets are used for training and remaining one for testing. The procedure is repeated five times in such a way that each set is tested once. This type of sub-sampling test (e.g., 5 or 7-fold cross-validation) is often used to validate the prediction performance of statistical methods. On the other hand, jack-knife test is deemed the most rigorous and objective as analyzed by a comprehensive review and has been increasingly adopted by investigators to test the power of various prediction methods (see, e.g.,[20–28]). In order to assess the performance of a model, we computed different parameters: sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC).
In order to assess the performance of similarity search approaches, we evaluated the performance of BLAST and PSI-BLAST on DNAset. We searched proteins of test set against training set proteins using BLAST/PSI-BLAST and assigned a query protein as DNA-binding or non-binding if the first hit was a DNA-BPs or NBPs respectively. We assigned a protein "unknown" if it has no significant similarity with any target protein.
In this study SVM_light, a freely available software package, has been used to implement SVM. The SVM is a supervised machine-learning method based on the structural risk minimization principle from statistical learning theory. It takes a set of feature vectors as input, along with their output, which is used for training of model. After training, learned model can be used for prediction of unknown examples[30, 31]. Detailed description of SVM can be found at Vapnik (1995). In this work, the SVM training has been carried out by the optimization of various kernel function parameters and the value of the regularization parameter C.
Where dpep(i) is fraction or composition of dipeptide type i. Di and N are number of dipeptide of type i and number of overlapping peptides in protein i, respectively.
Protein sequence was divided into four non-overlapping equal length sub-sequences. Then amino acid composition of each sub-sequence was computed. Composition of each sub-sequence was concatenated together to make final input vector of dimension 80.
Project name:DNA-binding proteins prediction;
Project home page:http://www.imtech.res.in/raghava/dnabinder/;
Operating system(s):Platform independent;
Any restriction to use by non-academics:No restriction.
We thank Dr. Xavier Suresh for his help in preparing the data. This work was supported by grants from Council of Scientific and Industrial Research (CSIR) and Department of Biotechnology (DBT), Government of India for financial assistance. Manish Kumar is a senior research fellow of CSIR. This research article has IMTech communication number 055/2006.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.