Dataset
In first step of data collection we use SuperSite documentation [11] and extract 675 PDB IDs of protein having contact with FAD in PDB. We download the sequence of all chains of these PDB IDs using web download section in PDB. In next step we use these PDB IDs in Ligand Protein Contact (LPC) [12] and get total 1539 chain which interacts with FAD with their corresponding interacting residues and its position. Then we remove redundant chains which have more than 40% similarity by using CD-HIT [13], finally retrieved a total 198 interacting chains with a total 5172 FIRs remaining all residues are non-FIRs. In this study we used 5172 FIRs and 5172 non-FIRs for developing our models. Sequences of these 198 FBP with their PDB ID and chain name are freely available [14], where FIRs are in lowercase and non-FIRs are in uppercase.
Five-fold cross-validation
Fivefold cross-validation technique has been used to evaluate the performance of all the models developed in this study. In this technique dataset is randomly divided into five sets where each set consist of nearly equal number of interacting and non-interacting patterns out of these five sets four sets are used for training and the remaining set for testing. This process is repeated five times in such a way that each set is used once for testing. The final performance is obtained by averaging the performance of all the five sets.
Pattern or window size
We generated an overlapping pattern of 17 residues, for each FAD interacting chains sequences. If the central residue of pattern was FIR, then we classified the pattern as positive or FIR pattern, otherwise it was termed as non-interacting or negative pattern. In this study we follow the similar approach adopted by Kaur and Raghava [15–17] for prediction of turns in protein sequences. In additional to 17 residue window we also generate pattern of 15 and 19 residues. In this study we used unique residue patterns for binary and PSSM pattern generation. Finally we have total 4896, 4974 and 4974 unique pattern for interacting residues respectively in 15, 17 and 19 residue window, and randomly picked equal number of non-interacting pattern as negative data.
Support Vector Machine (SVM)
An excellent machine learning technique SVM [18] has been used for the prediction of FIRs. All SVM models have been developed by using a freely available package SVM_light [19]. The SVM is particularly attractive to biological sequence analysis due to its ability to handle noise, dataset and large input space. Further details about SVM can be obtained from Vapnik's [20] paper. The software allows users to run SVM using various numbers of parameters as well as to select inbuilt kernel functions, including a linear, polynomial and Radial Basis Function (RBF).
Evolutionary information
This was obtained from position-specific scoring matrix (PSSM) generated from PSI-BLAST search against non-redundant (nr) database [21] of protein sequences. The PSSM matrix was generated by three iterations of searching at cutoff e-value of 0.001 for inclusion of sequences in next iteration. The generated PSSM contained the probability of occurrence of each type of amino acid at each position along with insertion/deletion. Hence, PSSM is considered as a measure of residue conservation in a given location. This means that evolutionary information for each amino acid is encapsulated in a vector of 20 dimensions where the size of PSSM matrix of a protein with N residues is 20 × N. Where 20 dimension are 20 standard amino acids. We normalized each value within 0-1 range using equation:
Where val is the PSSM score and Val is its normalized value.
Figure of merits
In this study performance of constructed modules has been evaluated by using five-fold cross-validation techniques. Following threshold dependent parameters: sensitivity (Sn) or percent coverage of FIR is the percentage of FIR residue predicted as FIR; specificity (Sp) or percent coverage of non-interacting residues is the percentage of non-FIR predicted as non-FIR; overall accuracy (Ac) is the percentage of correctly predicted interacting residues has been used for assessing the performance of method. These parameters can be using following equations:
Where TP is correctly predicted FIRs, TN is correctly predicted non-FIRs, FP is the number of non-FIRs predicted as FIR and FN is the number of FIRs wrongly predicted as non-FIR. Matthew's correlation coefficient (MCC) equal to 1 is regarded as a perfect prediction, whereas 0 is for completely random prediction. We also calculated AUC of ROC plot which is a threshold independent parameter.
Description of web server
The prediction method described in this paper is implemented in the form of a web-server FADPred [22]. The common gateway interface (CGI) script of FADPred is written using PERL version 5.03. FADPred server is installed on a Sun Server (420E) under UNIX (Solaris 7) environment. It is a user-friendly web server which allows users to submit their protein sequence in two different ways; first browse and upload the fasta sequence file and second, either type or paste fasta sequence in a box which is available on submit page. This server allows users to predict FAD binding residues using both binary pattern and PSSM based SVM models with different threshold range from -1 - +1. Here we provide option for both binary pattern and PSSM user can select according to their choice and get the result through mail also. The default method is PSSM and threshold is 0.0, sensitivity and specificity is roughly found to equal during the five-fold cross-validation procedure at this threshold. The prediction result presented in graphical form where the predicted FIRs and non-FIRs are displayed in different color. We are using PSSM as default option and it takes several minutes to predict FAD interacting residues in a protein.