In this study, we collected data from SuperSite documentation
 and extracted 1061 PDB IDs of protein having contact with vitamins in PDB. We downloaded the sequence of all chains of these PDB Ids from Protein Data Bank
. In next step, we used these PDB IDs in Ligand Protein Contact (LPC) web-server
 and get total 2720 chains that interact with vitamins with their corresponding interacting residues and its position. We used a cut-off of 5.0 Å to define the vitamin interacting residues. A residue was considered to be vitamin-interacting if the closest distance between atoms of the protein and the partner vitamin was within the cut-off (5 Å). The 25% non-redundant dataset of protein chains was created by using BLASTCLUST and finally retrieved a total 187 interacting chains with a total 3004 vitamin-interacting residues (VIRs) and remaining all residues are non-vitamin-interacting residues (non-VIRs). This step was repeated for the dataset development of vitamin A, vitamin B and PLP (vitamin B6-derived) interacting residue prediction and retrieved 538, 2207 and 1092 interacting residues in 31, 141 and 71 chains respectively. The interacting and non-interacting residues were used as positive and negative instances respectively. The number of non-interacting residues was very large than interacting residues so we have randomly picked up 10 times more non-interacting than interacting residues in order to create realistic dataset. The balanced datasets of equal positive and negative were also created, where equal numbers of random negative instances was taken from the total negative window patterns.
We created four different independent datasets: V-IND-46, VA-IND-15, VB-IND-27 and PLP-IND-16 of the 46, 15, 27 and 16 protein sequences for the prediction of VIRs, VAIRs, VBIRs and PLPIRs respectively. All these datasets were 25% non-redundant and all sequences of these independent datasets were less than 25% similar than sequences of main datasets.
Window patterns and size
We generated sliding (overlapping) patterns of 17-residue size, for each interacting chain sequence. In past, several studies have adopted this strategy for the interacting residue tools development
[40, 45]. If the central residue of pattern was interacting, then we classified the pattern as interacting or positive pattern; otherwise it was termed as non-interacting or negative pattern. To generate the pattern corresponding to the terminal residues in a protein sequence, we have added (L-1)/2 dummy residue "X" at both terminals of protein (where L is the length of pattern). Here the length of pattern is 17 so we have added 8 "X" before N-terminal and 8 "X" after C-terminal, in order to create equal number of patterns from sequence length.
Binary profile of patterns
These positive and negative patterns were converted into the binary patterns and all amino acids represented by a vector of 21 dimensions (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; Cys by 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), which contained 20 standard amino acids and one dummy amino acid “X”. We used these profiles as an input data of various machine-learning algorithms.
Position-Specific Scoring Matrix (PSSM)
We performed PSI-BLAST (position-specific iterative BLAST) search (default parameter) against the non-redundant (NR) database available at Swiss-Prot
. After three iterations, PSI-BLAST generated the PSSM profiles with the highest score from multiple alignments of the high-scoring hits by calculating the position-specific scores for each position in the alignments. The PSSM profile contains the occurrence probability of all amino acids at each position along with insertion/deletion and provides the evolutionary information for all amino acids. The final PSSM was normalized using a sigmoid function.
We calculated surface accessibility value for each residue of the all sequences using SARpred method
. We normalized these values (between minimum to maximum) and assigned a value for the each residue of the 17-length window patterns. We used these 17 input features for the SVM-based prediction of VIRs, VAIRs, VBIRs and PLPIRs. In the hybrid approach with PSSM, we combined these 17 input features with the PSSM features.
Support vector machine
In this study, a highly successful machine learning technique termed as a Support Vector Machine (SVM) was used. SVM is a machine-learning tool and based on the structural risk minimization principle of statistics learning theory. SVMs are a set of related supervised learning methods used for classification and regression
. The user can choose and optimize number of parameters and kernels (e.g. Linear, polynomial, radial basis function and sigmoidal) or any user-defined kernel. In this study, we implemented SVMlight Version 6.02 package
 of SVM and machine learning was carried out using three different (linear, polynomial and radial basis function) kernels. SVM takes a set of fixed length input features, along with their output, which is used for training of model. After training, learned model can be used for prediction of unknown examples
. We optimized different parameters and kernels for all approaches and developed efficient prediction tools.
WEKA is a large collection of various machine-learning algorithms as single package
. We applied WEKA 3.6.4 version, which integrates different classifiers such as BayesNet, NaiveBayes, ComplementNaiveBayes, NaiveBayesMultinomial, RandomForest and IBk. All algorithms have been applied and optimized for different prediction tool development.
Five-fold cross validation
The validation of any prediction method is very essential part. In this study, we have used a five-fold cross-validation technique
 for training, testing and evaluating our prediction methods. The protein sequences/patterns of positive and negative instances were randomly divided into five parts. Each of these five sets consists of one-fifth of positive and one-fifth of negative instances. In this technique, the training and testing was carried out five times, each time using one distinct set for testing and the remaining four sets for training.
To assess the performance of various modules developed in this study, we calculated the sensitivity, specificity, accuracy and Matthew's correlation coefficient (MCC). These calculations were routinely used in these types of prediction-based studies
]. These parameters were calculated using following equations (14):
Where TP and TN are correctly predicted positive and negative examples, respectively. Similarly, FP and FN are wrongly predicted positive and negative examples respectively.
The standalone version of VitaPred
gives prediction results with probability score instead of SVM score. We have calculated probability score by using following equation –
We rescaled the SVM scores with maximum 1.5 and minimum −1.5, where more than 1.5 and less than −1.5 both scores were used as 1.5 and −1.5 respectively. The probability score varies from 0–9 for each residue of protein sequence. The probability scores ranges between 0–4 and 5–9 predicted as non-interacting and interacting residues respectively at default 0.0 thresholds.
The five fold cross-validation technique created five test sets and calculated performance for each test set. The final performance of prediction model is an average performance of these five test sets. In this average performance, we also calculated standard error of the performance of these five test set. MCC is considered to the most robust parameters for the evaluation of any prediction method
. The MCC value ranges between +1 to −1. The MCC value of 1 corresponds to a perfect prediction, whereas 0 corresponds to a completely random prediction. The −1 MCC value indicates total disagreement between prediction and actual examples. The evaluation parameters of SVM performances are threshold-dependent and require parameters/kernels optimization for the better results. The complete optimization of all parameters is key step in SVM based machine learning. We manually optimized all parameters and selected the highly performed prediction models for different tasks. In order to have a threshold independent evaluation of our method, we also created ROC and calculated AUC value for the threshold independent evaluation using SPSS statistical package.
Two sample logo (TSL)
In this study, we have created Two Sample Logo (http://www.twosamplelogo.org/) for the graphical representation of positive and negative patterns
. It is a web-based application to calculate and visualize position-specific differences between positive and negative samples.
A user-friendly web-server VitaPred developed for the prediction of VIRs, VAIRs, VBIRs and PLPIRs in protein sequence. The VitaPred is freely available from
http://crdd.osdd.net/raghava/vitapred/ web-address. It requires protein sequence in standard FASTA format. There are four different type of options provided for the prediction of VIRs, VAIRs, VBIRs and PLPIRs. We have also provided our datasets and other supplementary materials, which were used for the development of VitaPred web-server.
Standalone version of VitaPred
In the era of genomics, it is essential to develop computational tools for the huge amount of sequence data. We have developed standalone version of VitaPred by using Visual Basic .NET technologies. This is available from the site of web-server. User can download and install it in their system. This software gives the results with probability scores (Equation 5) for each residue of protein sequences. The multiple sequences can efficiently proceed with this software.