PSSM-based prediction of DNA binding sites in proteins
© Ahmad and Sarai; licensee BioMed Central Ltd. 2005
Received: 18 November 2004
Accepted: 19 February 2005
Published: 19 February 2005
Detection of DNA-binding sites in proteins is of enormous interest for technologies targeting gene regulation and manipulation. We have previously shown that a residue and its sequence neighbor information can be used to predict DNA-binding candidates in a protein sequence. This sequence-based prediction method is applicable even if no sequence homology with a previously known DNA-binding protein is observed. Here we implement a neural network based algorithm to utilize evolutionary information of amino acid sequences in terms of their position specific scoring matrices (PSSMs) for a better prediction of DNA-binding sites.
An average of sensitivity and specificity using PSSMs is up to 8.7% better than the prediction with sequence information only. Much smaller data sets could be used to generate PSSM with minimal loss of prediction accuracy.
One problem in using PSSM-derived prediction is obtaining lengthy and time-consuming alignments against large sequence databases. In order to speed up the process of generating PSSMs, we tried to use different reference data sets (sequence space) against which a target protein is scanned for PSI-BLAST iterations. We find that a very small set of proteins can actually be used as such a reference data without losing much of the prediction value. This makes the process of generating PSSMs very rapid and even amenable to be used at a genome level. A web server has been developed to provide these predictions of DNA-binding sites for any new protein from its amino acid sequence.
Online predictions based on this method are available at http://www.netasa.org/dbs-pssm/
There has been a growing interest in the prediction of DNA-binding sites in proteins which play crucial roles in gene regulation [1–4]. We have previously developed a method of predicting DNA-binding sites of proteins from the sequence information . We reported development of a neural network and corresponding web server to predict amino acid residues which are likely to bind DNA. The only input to the neural network in this algorithm was the identity of the amino acid residue and its two sequence neighbors on C- and N- terminals. We also developed a method to identify DNA-binding proteins using electrical moments from structural information of proteins . On the other hand, several investigators have reported that the use of evolutionary information in sequence-based predictions of secondary structure and solvent accessibility can improve the prediction capacity of a neural network [7–10].
Here we report the use of such evolutionary information in improving the prediction of DNA-binding sites of proteins. We note that one of the major problems in applying evolutionary information by way of position specific scoring matrices (PSSMs) for sequence based prediction is that such matrices are generated over large data sets and take a long time to complete. Thus large scale predictions remain inaccessible to moderately capable computers. This is a serious limitation in the portability of neural network based predictions using PSSMs . In this work, we report that evolutionary profiles or PSSMs against much smaller representative reference data sets may be utilized to achieve almost the same levels of prediction as would be obtained from alignments with large sequence data sets representing entire available sequence space. We have used four different reference data sets of PSSMs for 62 representative protein sequences. These are (1) PDNA-RDN: a data set of protein sequences from all Protein-DNA complexes from the PDB, (2) PDNA-NR90: a non-redundant data set compiled from PDNA-RDN, (3) PDB-ALL: a data set of all amino acid sequences from PDB and (4) NCBI-NR: a non-redundant data set of all protein sequences taken from sequence and structure databases and compiled by NCBI (see Methods). We find that the net prediction (an average of sensitivity and specificity) of the best of these systems (using PIR sequence data as reference) improves to 67.1% from the value of 58.4% reported earlier for a sequence-only prediction. We also report that a small reference data set of 375 sequences (PDNA-NR90) can give a 64.6% net prediction – just 2.5% poorer than the best- while reducing the PSSM calculation time from more than two hours (against NCBI-NR) to just about one minute. A better compromise could be the use of PDNA-RDN data for which 65.2% net prediction #150; 1.9% less than the best- was obtained, while about 2 and a half minutes are taken to generate their PSSMs. It is also reported that the presence of redundancy is helpful in improving the prediction whereas presence of data not relevant for DNA-binding may in some cases reduce predictive performance.
Results and discussion
Position Specific Iterative BLAST (PSI BLAST) is a strong measure of residue conservation in a given location. In the absence of any alignments, PSI BLAST simply returns a 20-dimensional vector representing probabilities of conservation against mutations to 20 different amino acids including itself. A matrix consisting of such vector representations for all residues in a given sequence is called Position Specific Scoring Matrix or PSSM. When a residue is conserved through cycles of PSI BLAST, it is likely to be due to a purpose i.e. biological function. It has been established by several authors cited in the introduction that the prediction of structural properties is significantly enhanced by the use of PSSMs compared to predictions based on unique representations of amino acid sequence and its environment. Protein structure universe is vast and a prediction of structural properties should span the entire range of this diversity. However, the question of predicting DNA-binding sites is much narrower and hence the significance of conservation of residues at specific locations may be limited to a subset of this protein space. Such reduction in the protein search space or the reference data sets against which PSSM-based predictions should be attempted is desired for a rapid prediction of binding sites as well as portability of prediction methods. Compact reference data size can not only answer these questions of speed and portability but also try to minimize noise in information contents and improve prediction quality.
Prediction results for binding sites in 62 Proteins with different data sets used for generating PSSM.
Overall Correct predictions (%)
Sensitivity (S1) %
Specificity (S2) %
Net Prediction (S1+S2)/2 %
Sequence only (No PSSM)
PDNA-NR90 375 sequences
PDNA-RDN 1386 sequences
NCBI-NR 1,547,365 sequences
PDB-ALL 47,179 sequences
PIR 283,177 sequences
In terms of CPU time, it may be noted that the time taken by 62 protein sequences used here is about one hour for the best (PIR) data sets. These times are prohibitively large for making predictions at a genomic scale or for providing rapid web services. A compromise could be obtained by using PDNA-RDN instead, which reduces the CPU time by a factor more than 8. The loss of net prediction for this compromise is about 1.9%, which is still 6.8% better than the predictions obtained from sequence information only. PSSMs against this data set for a typical protein of 500 residues can be generated in about 1 s, making it possible to run large scale predictions. A smaller size of reference data and high speed of PSSMs also make this method portable and light weight with a strong predictive ability.
We have provided online predictions based on the above method at our web site . The raw probability scores, their annotations at different sensitivity thresholds, and a reference scale for expected sensitivity and specificity have been provided. In addition, results of sequence alignments obtained after PSI BLAST iterations against a reference data (PDNA-RDN) are also provided. This allows us to have a complete picture of similarity of a given sequence with known DNA-binding proteins and predictions based on neural network using alignment profiles in the form of PSSM. The only input to this neural network is the amino acid sequence of the protein. The web server will automatically generate PSSMs of the given sequence against a reference data and use them as the input to a neural network, trained for predictions of 62 DNA-binding proteins.
A PSSM-based neural network method for predicting DNA-binding sites in proteins has been developed. PSSMs were developed against different data sets and it was observed that significant computer time can be saved by replacing the reference data sets with much smaller reference data sets without loss of much prediction ability. Redundant reference data sets show a better prediction than the non-redundant data sets. A web server was developed to provide prediction of DNA-binding sites based on this method. In addition, the web server provides BLAST alignments against a reference data set of known DNA-binding proteins.
This is a new data set, developed for this work. We have selected all Protein-DNA complexes from PDB and separated their chains. 1386 protein chains were obtained in this way. FASTA formatted sequences were subsequently formatted using formatdb program of the BLAST package .
The data set (PDNA-RDN), obtained from the procedure mentioned above was filtered to remove redundancy at 90% sequence identity level by using sequence clustering program BLASTCLUST . Resulting data set now contains 375 sequences which are formatted for use as a reference data set using formatdb. This data set is called PDNA-NR90.
Other data sets
PDB-ALL (47,189 sequences) is a data set of all protein sequences obtained from NCBI. PIR is the sequence data set (283,177 sequences) of Protein Information Resource at Georgetown University . NCBI-NR is a non-redundant data set of all protein sequences compiled from GeneBank, PIR, SwissProt, PDB and other resources by NCBI .
Generation of PSSMs
Target sequences are scanned against the reference data sets to compile a set of alignment profiles or position specific scoring matrices (PSSMs) using Position Specific Iterative BLAST (PSI BLAST) program . Three cycles of PSI-BLAST were run for each protein and the scores were saved as profile matrices (PSSMs).
Neural network inputs
Network architecture and transfer function
We use a neural network with one hidden layer (two nodes) in addition to the input layer described above and a single node output layer. Large number of units in the hidden layer and additional layers were not tried because the data size does not justify an unreasonably large neural network. Network signal is transferred to subsequent layers by an algebraic summation of inputs from the previous layer. Total signal in the last unit is transformed to a real output by a binary decision function much in the same way as in our previous work except that the input to the network is now replaced by PSSM scores rather than 20 bit binary coding .
Training and validation
A six-fold cross-validation has been used in this work. Out of 62 proteins, 10 were removed at one time and the remaining 52 were trained until the accuracy on the left-out 10 also improved. Six random sets are created in this way and the figures in Table 1 report the averages on all six runs of each set of 10 proteins.
Training error function and measure of prediction quality
Data imbalance in the two binary categories for this neural network makes the choice of error function particularly important. We have used an accuracy score called Net Prediction, which is the average of sensitivity and specificity values defined below. Neural network learns to maximize this accuracy score rather than minimizing an error function.
Sensitivity is defined as the number of correct prediction in the binding category relative to total number of such items in the original data and specificity is the number of correctly rejected residues in this category relative to the total number of non-binding residues in the original data.
Sensitivity (S1) = 100 * TP/(TP+FN) (1)
Specificity (S2) = 100 * TN/(TN+FP) (2)
where TP: True Positive; TN: True Negative; FP: False Positive; FN: False Negative
Relative number of true positive (TP) values in the prediction was termed as accuracy in our previous work. We have avoided using that term here, as we prefer a more operational definition of accuracy measures here. The imbalance of sensitivity (S1) and specificity (S2) is taken care of by comparing the Net Prediction of the models which gives a better comparison when S1 and S2 vary from one sample to the other. Thus,
Net Prediction (NP) = (S1+S2)/2 (3)
List of abbreviations
Position Specific Substitution Matrix
- PSI BLAST:
Position Specific Iterative Basic Local Alignment Search Tool
Corresponding author would like to thank Advanced Technology Institute Inc. for partially supporting this research. This work was supported in part by Grants-in-Aid for Scientific Research 16014219 and 16041235 to A.S.
- Gutfreund MY, Margalit H: Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. Nucleic Acids Res 1998, 26: 2306–2312. 10.1093/nar/26.10.2306View ArticleGoogle Scholar
- Pabo CO, Nekludova L: Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition? J Mol Biol 2000, 301: 597–624. 10.1006/jmbi.2000.3918View ArticlePubMedGoogle Scholar
- Luscombe NM, Thornton JM: Protein-DNA Interactions: Amino Acid Conservation and the Effects of Mutations on Binding Specificity. J Mol Biol 2002, 320: 991–1009. 10.1016/S0022-2836(02)00571-5View ArticlePubMedGoogle Scholar
- Stawiski EW, Gregoret LM, Mandel-Gutfreund Y: Annotating Nucleic Acid binding function based on protein structure. J Mol Biol 2003, 326: 1065–1079. 10.1016/S0022-2836(03)00031-7View ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: Analysis and Prediction of DNA-binding proteins and their binding residues based on Composition, Sequence and Structural Information. Bioinformatics 2004, 20: 477–486. 10.1093/bioinformatics/btg432View ArticlePubMedGoogle Scholar
- Ahmad S, Sarai A: Moments based prediction of DNA-binding proteins. J Mol Biol 2004, 341: 65–71. 10.1016/j.jmb.2004.05.058View ArticlePubMedGoogle Scholar
- Rost B, Sander C: Improved prediction of protein secondary structure by using sequence profiles and neural networks. Proc Natl Acad Sci USA 1993, 90: 7558–7562.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position specific scoring matrices. J Mol Biol 1999, 292: 195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar
- Cuff JA, Barton GJ: Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 2000, 40: 502–11. 10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-QView ArticlePubMedGoogle Scholar
- Adamczak R, Porollo A, Meller J: Accurate prediction of solvent accessibility using neural networks based regression. Proteins 2004, 56: 753–767. 10.1002/prot.20176View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Selvaraj S, Kono H, Sarai A: Specificity of Protein-DNA RecognitionRevealed by Structure-based Potentials: Symmetric/Asymmetric and Cognate/Non-cognate Binding. J Mol Biol 2002, 322: 907–915. 10.1016/S0022-2836(02)00846-XView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Apweiler R, Bairoch A, Wu CH: Protein sequence databases. Curr Opin Chem Biol 2004, 8: 76–80. 10.1016/j.cbpa.2003.12.004View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- DBS-PSSM: Prediction of DNA-binding sites by PSSM and sequence homology[http://www.netasa.org/dbs-pssm/]
- NCBI BLAST databases download web site:[ftp://ftp.ncbi.nlm.nih.gov/blast/db/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.