- Open Access
Prediction of FAD interacting residues in a protein from its primary sequence using evolutionary information
BMC Bioinformaticsvolume 11, Article number: S48 (2010)
Flavin binding proteins (FBP) plays a critical role in several biological functions such as electron transport system (ETS). These flavoproteins contain very tightly bound, sometimes covalently, flavin adenine dinucleotide (FAD) or flavin mono nucleotide (FMN). The interaction between flavin nucleotide and amino acids of flavoprotein is essential for their functionality. Thus identification of FAD interacting residues in a FBP is an important step for understanding their function and mechanism.
In this study, we describe models developed for predicting FAD interacting residues using 15, 17 and 19 window pattern. Support vector machine (SVM) based models have been developed using binary pattern of amino acid sequence of protein and achieved maximum accuracy 69.65% with Mathew's Correlation Coefficient (MCC) 0.39 and Area Under Curve (AUC) 0.773. The performance of these models have been improved significantly from 69.65% to 82.86% with MCC 0.66 and AUC 0.904, when evolutionary information is used as input in SVM. The evolutionary information was generated in form of position specific score matrix (PSSM) profile by using PSI-BLAST at e-value 0.001. All models were developed on 198 non-redundant FAD binding protein chains containing 5172 FAD interacting residues and evaluated using fivefold cross-validation technique.
This study suggests that evolutionary information of 17 amino acid patterns perform best for FAD interacting residues prediction. We also developed a web server which predicts FAD interacting residues in a protein which is freely available for academics.
Determining function of a protein is one of the most challenging problems of the post-genomic era. In past various techniques have been developed for predicting the function of proteins using information derived from sequence similarity or clustering patterns of co-regulated genes, interaction of protein etc. It is important to understand interaction of protein with other proteins or ligands in order to understand it function. One of most important ligands among the molecules that interact with proteins is nucleotide. Prediction of proteins and nucleotide interaction can be divided in two categories (I) short nucleotide interaction, where short nucleotide (mono/di/trinucleotide) interacts with proteins (II) polynucleotide interactions, where polynucleotide (DNA/RNA) interacts with proteins.
Many proteins use small nucleotides as a source of energy and signaling molecules inside the cell (adenine and guanine nucleotides). The flavin and nicotinamide nucleotides work as electron donor/acceptors in lots of cellular metabolic reactions. FAD is a redox cofactor involved in several important reactions in metabolism. Living organism mostly generate energy by using glucose or fat molecules, both metabolic pathway regulated by enzyme which prosthetic group is FAD. Thus, identification of FAD interacting residue (FIR) is very important in molecular recognition. Despite tremendous progress in the annotation of protein, there is no any existing online tools are available for the prediction of FIR using primary sequence. Experimental method to identify FIR is very difficult and time consuming process and also very costly. We can easily identify either FAD interact with protein or not by using absorption spectra but can't identify which residues are FIRs.
In the past large number of tools have been developed for the prediction of polynucleotide (DNA/RNA) interacting residues using different machine learning techniques [1–7]. In contrast there has been only one preexisting method available for the prediction of small nucleotide-protein interaction, developed by Saito et al . They developed method for small nucleotide binding site prediction using empirical score approach and detect 40% FAD binding sites correctly. Saito et al. methods only give us idea about binding site but can't give information about the FAD interacting residues. Kallberg et al.  used simple sequence in Hidden Markov Model and developed method for identifying Rossmann folds and predict there coenzyme specificity (NAD, NADP, FAD) and found that FAD least preferred cofactor. So there studies suggest that FAD interacting residues can't predict easily. Thus, the development of computational method for predicting FIR in a protein from its amino acid sequence is very important for functional annotation of proteins.
In this work, a systematic attempt has been made to predict FIRs in a protein sequences using binary pattern and PSSM profiles of 5172 FIRs and non-FIRs of 198 non-redundant protein chains. In first step FBP chain were analyzed, then SVM model were developed by using binary pattern of FIRs. It have been observed in past that evolutionary information provide more information, thus we developed SVM models using PSSM profiles obtained from PSI-BLAST . All models developed in this study were evaluated using five-fold cross validation technique. FADPred can directly predict the FAD interacting residues using binary pattern of sequence and its evolutionary information. Our server will be useful for experimental biologists working on flavoproteins/flavoenzymes.
In first step of data collection we use SuperSite documentation  and extract 675 PDB IDs of protein having contact with FAD in PDB. We download the sequence of all chains of these PDB IDs using web download section in PDB. In next step we use these PDB IDs in Ligand Protein Contact (LPC)  and get total 1539 chain which interacts with FAD with their corresponding interacting residues and its position. Then we remove redundant chains which have more than 40% similarity by using CD-HIT , finally retrieved a total 198 interacting chains with a total 5172 FIRs remaining all residues are non-FIRs. In this study we used 5172 FIRs and 5172 non-FIRs for developing our models. Sequences of these 198 FBP with their PDB ID and chain name are freely available , where FIRs are in lowercase and non-FIRs are in uppercase.
Fivefold cross-validation technique has been used to evaluate the performance of all the models developed in this study. In this technique dataset is randomly divided into five sets where each set consist of nearly equal number of interacting and non-interacting patterns out of these five sets four sets are used for training and the remaining set for testing. This process is repeated five times in such a way that each set is used once for testing. The final performance is obtained by averaging the performance of all the five sets.
Pattern or window size
We generated an overlapping pattern of 17 residues, for each FAD interacting chains sequences. If the central residue of pattern was FIR, then we classified the pattern as positive or FIR pattern, otherwise it was termed as non-interacting or negative pattern. In this study we follow the similar approach adopted by Kaur and Raghava [15–17] for prediction of turns in protein sequences. In additional to 17 residue window we also generate pattern of 15 and 19 residues. In this study we used unique residue patterns for binary and PSSM pattern generation. Finally we have total 4896, 4974 and 4974 unique pattern for interacting residues respectively in 15, 17 and 19 residue window, and randomly picked equal number of non-interacting pattern as negative data.
Support Vector Machine (SVM)
An excellent machine learning technique SVM  has been used for the prediction of FIRs. All SVM models have been developed by using a freely available package SVM_light . The SVM is particularly attractive to biological sequence analysis due to its ability to handle noise, dataset and large input space. Further details about SVM can be obtained from Vapnik's  paper. The software allows users to run SVM using various numbers of parameters as well as to select inbuilt kernel functions, including a linear, polynomial and Radial Basis Function (RBF).
This was obtained from position-specific scoring matrix (PSSM) generated from PSI-BLAST search against non-redundant (nr) database  of protein sequences. The PSSM matrix was generated by three iterations of searching at cutoff e-value of 0.001 for inclusion of sequences in next iteration. The generated PSSM contained the probability of occurrence of each type of amino acid at each position along with insertion/deletion. Hence, PSSM is considered as a measure of residue conservation in a given location. This means that evolutionary information for each amino acid is encapsulated in a vector of 20 dimensions where the size of PSSM matrix of a protein with N residues is 20 × N. Where 20 dimension are 20 standard amino acids. We normalized each value within 0-1 range using equation:
Where val is the PSSM score and Val is its normalized value.
Figure of merits
In this study performance of constructed modules has been evaluated by using five-fold cross-validation techniques. Following threshold dependent parameters: sensitivity (Sn) or percent coverage of FIR is the percentage of FIR residue predicted as FIR; specificity (Sp) or percent coverage of non-interacting residues is the percentage of non-FIR predicted as non-FIR; overall accuracy (Ac) is the percentage of correctly predicted interacting residues has been used for assessing the performance of method. These parameters can be using following equations:
Where TP is correctly predicted FIRs, TN is correctly predicted non-FIRs, FP is the number of non-FIRs predicted as FIR and FN is the number of FIRs wrongly predicted as non-FIR. Matthew's correlation coefficient (MCC) equal to 1 is regarded as a perfect prediction, whereas 0 is for completely random prediction. We also calculated AUC of ROC plot which is a threshold independent parameter.
Description of web server
The prediction method described in this paper is implemented in the form of a web-server FADPred . The common gateway interface (CGI) script of FADPred is written using PERL version 5.03. FADPred server is installed on a Sun Server (420E) under UNIX (Solaris 7) environment. It is a user-friendly web server which allows users to submit their protein sequence in two different ways; first browse and upload the fasta sequence file and second, either type or paste fasta sequence in a box which is available on submit page. This server allows users to predict FAD binding residues using both binary pattern and PSSM based SVM models with different threshold range from -1 - +1. Here we provide option for both binary pattern and PSSM user can select according to their choice and get the result through mail also. The default method is PSSM and threshold is 0.0, sensitivity and specificity is roughly found to equal during the five-fold cross-validation procedure at this threshold. The prediction result presented in graphical form where the predicted FIRs and non-FIRs are displayed in different color. We are using PSSM as default option and it takes several minutes to predict FAD interacting residues in a protein.
We calculated the percentage composition of interacting and non-interacting residues and found Gly, Tyr and Ser residues were more abundantly interact with FAD as compare to non-interacting residues (Figure 1). The dominance of these residues shows a vital role of these residues in FAD interaction.
SVM model using binary pattern of amino acid sequence
In our study we generated multiple 17 residue long window for negative and positive pattern. These sequence patterns were converted to binary patterns, where a pattern of length N was represented by a vector of dimension N × 21 and each amino acid in that pattern was represented by a vector of 21 (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) which contained 20 amino acids and one dummy amino acid X. As shown in Table 1, this SVM-based model was able to give a maximum MCC of 0.39 with 69.65% accuracy having minimum difference in sensitivity and specificity. Threshold at which sensitivity and specificity is nearly same is shown by bold font, in order to balance sensitivity and specificity. Similarly we achieved accuracy 70.31% with MCC of 0.41 for 15 window patterns and accuracy 70.49 with MCC of 0.41 for 19 window patterns. AUCs are 0.769, 0.773 & 0.770 respectively for 15 window, 17 window and 19 window pattern models (Table 2). The performances of models were evaluated at residue level.
SVM model using evolutionary information
In the past, it has been shown in several studies that evolutionary information obtained using multiple sequence alignment provides more comprehensive information about the protein than a single sequence [1, 6, 15–17]. In this study, the evolutionary information obtained from PSSM generated from PSI-BLAST profiles was used for predicting FIRs. As shown in Table 3, performance increased significantly when PSSM was used as input instead of single sequence. A maximum MCC of 0.62 was achieved with 80.82% accuracy using evolutionary information. Similarly we achieved accuracy 80.29 with MCC of 0.61 for 15 window patterns and accuracy 80.39 with MCC of 0.61 for 19 window patterns. AUCs are 0.878, 0.904 & 0.876 respectively for 15 window, 17 window and 19 window pattern models (Table 2).
Due to the vital roles of FAD binding proteins in cellular metabolism and difficulties in iv-vitro analysis or identification of FIRs, by biophysical method, there is as urgent need for computational method to identify FAD binding sites on the basis of amino acid sequence of a protein. In this direction, we had followed a systematic attempt to develop a highly accurate and robust method for predicting FAD binding residues in protein sequences. There is no any preexisting online method in our knowledge for the prediction of FIRs using primary sequence. So first of all we developed method for predicting FIRs using sequence of FBP proteins. For this study firstly we collect the information of FAD binding proteins PDB IDs with SuperSite, fasta sequence from PDB and FAD interacting residues using LPC. Then analyze FIRs and found that there is significant difference in interacting residues as well as flanking residues.
It has been reported in some of the earlier studies that SVM perform better than any other artificial intelligence (AI) techniques in interacting residue prediction. First we developed SVM model based on binary patterns of amino acid sequence. Manish et al. 2008 showed that evolutionary information perform better than simple sequence information in interacting residue prediction. Further we used evolutionary information to generate PSSM profile as input for SVM model and check overall performance of FIRs prediction. SVM parameter for each model with their AUC is given in Table 2. One of the obvious questions is why we can't use BLAST for predicting FIRs. Thus we also make an attempt to predict FAD interacting residues using BLAST and achieved very poor performance (data not shown). We also provide a direct access of our developed prediction method to public, through web server FADPred. FADPred allow users to predict FAD interacting residues in their protein sequence.
In this study first, time a highly robust method has been developed to predict FAD interacting residues from protein sequence using AI technique, SVM. This study demonstrates that PSSM based method performs better than simple sequence based method. In this study we also observed that 17 window pattern perform better than 15 and 19 window pattern (Table 2, Figure 2). This study will be helpful for biologist in proteome annotation. One of the major advantage of this study is that we developed free web server; FADPred. Our web server allows users to identify FAD interacting residue in given sequence using the model trained on our data set.
Kumar M, Gromiha MM, Raghava GPS: Prediction of RNA binding sites in the protein using SVM and PSSM profile. Proteins 2008, 71: 189–194. 10.1002/prot.21677
Jeong E, Chung IF, Miyano S: A neural network method for identification of RNA-interacting residues in protein. Genome Inform 2004, 15: 105–116.
Jeong E, Miyano SA: Weighted profile based method for protein-RNA interacting residue prediction. In Lecture notes in computer science. Volume 3939. Edited by: Corrado P, Luca C, Stephen E. Berlin/Heidelberg: Springer; 2006:123–139. full_text
Bhardwaj N, Lu H: Residue-level prediction of DNA-binding sites and its application on DNA-binding proteins. FEBS Lett 2007, 581: 1058–1066. 10.1016/j.febslet.2007.01.086
Ofran Y, Mysore V, Rost B: Prediction of DNA-binding residues from sequence. Bioinformatics 2007, 23: i347–353. 10.1093/bioinformatics/btm174
Kuznetsov IB, Gou Z, Li R, Hwang S: Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins 2006, 64: 19–27. 10.1002/prot.20977
Ahmad S, Gromiha MM, Sarai A: Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 2004, 20: 477–486. 10.1093/bioinformatics/btg432
Saito M, Go M, Shirai T: An empirical approach for detecting nucleotide-binding sites on proteins. Protein Engineering Design Selection 2006, 19: 67–75. 10.1093/protein/gzj002
Korllberg Y, Persson B: Prediction of coenzyme specificity in dehydrogenases/reducatases. A hidden Markov model-based method and its application on complete genomes. FEBS Journal 2006, 273: 1177–1184. 10.1111/j.1742-4658.2006.05153.x
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
Bauer RA, Günther S, Heeger C, Jansen D, Thaben P, Preissner R: SuperSite: Dictionary of metabolite and drug binding sites in proteins. Nucleic Acids Res 2008, 37: D195–200. 10.1093/nar/gkn618
Sobolev V, Sorokine A, Prilusky J, Abola EE, Edelman M: Automated analysis of interatomic contacts in proteins. Bioinformatics 1999, 15: 327–332. 10.1093/bioinformatics/15.4.327
Li W, Godzic A: Cd-hit: a fast program for clustering and computing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659. 10.1093/bioinformatics/btl158
Kaur H, Raghava GPS: Prediction of α-turns in proteins using PSI-BLAST profiles and secondary structure information. Proteins 2004, 55: 83–90. 10.1002/prot.10569
Kaur H, Raghava GPS: Prediction of β-turns in proteins from multiple alignment using neural network. Protein Sci 2003, 12: 627–634. 10.1110/ps.0228903
Kaur H, Raghava GPS: A neural-network based method for prediction of γ-turns in proteins from multiple sequence alignment. Protein Sci 2003, 12: 923–929. 10.1110/ps.0241703
Joachims T: Making large scale SVM learning practical. In Advances in kernel methods:Support Vector Learning. Edited by: Scholkopf B, Burges C, Smola A. Cambridge: MIT Press; 1999:169–184.
Vapnik V: The nature of statistical learning theory. New York: Springer; 1995.
The author's are thankful to the Council of Scientific and Industrial Research (CSIR) and Department of Biotechnology (DBT), Government of India for financial assistance. Nitish Kumar Mishra is a senior research fellow and financially supported by CSIR. This research article has IMTech communication number 019/2009.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
The authors declare that they have no competing interests.
Mishra NK carried out the data analysis and interpretation, developed computer programs, wrote the manuscript and developed the web server and created clean datasets. GPSR conceived and coordinated the project, guided its conception and design, helped in interpretation of data, refined the drafted manuscript and gave overall supervision to the project. All authors read and approved the final manuscript.