Prediction of RNA-binding amino acids from protein and RNA sequences

Background Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. Results We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others. Conclusions The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence.


Background
Interactions between proteins and RNA are fundamental to many cellular processes [1]. Much experimental and theoretical effort has been made to study protein-RNA interactions, but their precise mechanism is not fully understood. Motivated by the recent increase in structures of protein-RNA complexes, several theoretical studies such as supervised learning have been carried out to predict RNA-binding residues in protein sequences. For example, BindN [2] uses a support vector machine (SVM) to predict the RNA-or DNA-binding residues in a protein sequence based on the chemical properties of amino acids. RNABindR [3,4] predicts the RNA-binding residues in a protein sequence using a Naïve Bayes classifier. However, none of these consider interacting partners (i.e., RNA) for a given protein when predicting RNAbinding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules.
We previously studied the interactions between protein and RNA [5,6]. In an effort to discover binding-specific features of amino acids and nucleotides, we performed an extensive analysis of the recent structures of protein-RNA complexes and computed several types of interaction propensity (IP) between amino acids and nucleotides [7]. Our analysis revealed that the IP of amino acid triplets has a higher binding specificity than IP of individual amino acids or other biochemical properties. In this study, we modified the previous interaction propensity of amino acid triplets and computed the new interaction propensity from more structures of protein-RNA complexes. In addition to the interaction propensity, we identified several features of protein and RNA sequences which are effective for predicting RNA-binding amino acids in a protein sequence.
In supervised learning approaches, preparing enough training data is crucial for its success, but the training data should be non-redundant and representative. Many learning approaches to predicting RNA-binding residues construct a training dataset based on the similarity of protein sequences without considering their binding and partial sequence information [2][3][4][8][9][10]. However, these approaches eliminate much binding information from a training dataset, which would otherwise be valuable for predicting binding sites. Consider the protein sequences of Figure 1, which were grouped by CD-HIT [11] based on sequence similarity. CD-HIT selects the protein chain A of the protein-RNA complex 1F7Y (1F7Y:A) as the representative sequence of the cluster since it is the longest sequence in the cluster. For a small training dataset, removing all the sequences except the representative one will make the training dataset smaller beyond the boundary of practicality. More importantly, the binding information of the removed sequences is lost during the redundancy reduction of the sequences. Including all the redundant sequences of the cluster in a training dataset would yield a classifier prone to over-fitting due to exposure to a highly redundant training dataset.
In the standard redundancy reduction practice for sequence data, a sequence is included in a training dataset as a whole or discarded from a training dataset. It is a take all or nothing approach, so no partial sequence of the original sequence is allowed to be included in a training dataset. As a solution, we proposed a feature vectorbased reduction of data redundancy. In the feature vector-based approach, two identical feature vectors with different binding labels (i.e., one is binding and the other is non-binding) are considered as different feature vectors and both are included in a training dataset. The feature vector-based redundancy reduction method constructs a powerful training dataset, especially when there is not sufficient data available.
This paper describes a support vector machine (SVM) model that predicts RNA-binding residues by considering both RNA and protein sequences. The SVM model was trained on a non-redundant dataset constructed by the feature vector-based redundancy reduction method. To the best of our knowledge, this is the first attempt to predict RNA-binding amino acids by considering the RNA sequence that interacts with the protein. The rest of the paper presents the details of the SVM model and its experimental results.

Definition of protein-RNA binding sites
Different studies have used slightly different criteria for protein-RNA binding sites. For example, in BindN [2] and RNABindR [3,4] an amino acid with an atom within a distance of 5Å from any other atom of a nucleotide was considered to be an RNA-binding amino acid. However, we use stricter criteria than that. In our study, an RNAbinding amino acid should satisfy the following geometric criteria for hydrogen boding (H bond) interaction with RNA: the contacts with the donor-acceptor (D-A) distance < 3.9Å, the hydrogen-acceptor (H-A) distance < 2.5Å, the donor-hydrogen-acceptor (D-H-A) angle > 90°a nd the hydrogen-acceptor-acceptor antecedent (H-A-AA) angle > 90°.
These criteria are slightly different from the ones used in our previous studies [6,12] but they are the most commonly used criteria for H bonds. In particular, the criteria of a H-A distance < 2.5Å and a D-H-A angle > 90°are essential for the H bonds [13].

Interaction propensity of amino acid triplets
A same amino acid can have different interaction propensities with different neighbors or at different secondary structures. We computed the interaction propensity of three consecutive amino acids in a sequence (called amino acid triplet or triple amino acids) and used the interaction propensity to predict RNA-binding residues. The interaction propensity IP tb between the amino acid triplet t and the nucleotide b is defined by equation (1).
This definition of IP tb in equation (1) slightly differs from the definition we used in our previous studies [5][6][7]: (1) IP tb uses the inverse of the value that the H-A distance multiplied by cosine of D-A-H angle of the Hbonds between an amino acid in a triplet and a nucleotide instead of using the H-bonds between an amino acid and a nucleotide, and (2) IP tb uses amino acid triplets instead of individual amino acids.
In equation (1), HA DAH tb cos( ) ∠ is the sum of the inverse projected distance of H-A on the D-A between triplet t and the binding nucleotide b, N PR is the total number of amino acid triplets that bind to any nucleotides, N t is the number of triplets t, N P is the total number of amino acid triplets, N b is the number of nucleotide b and N R is the total number of nucleotides in the dataset. The purpose of using the projected distance of H-A distance on D-A in the H bonds is to consider H-A distance as well as the locational relationship to D-A. The purpose of using the inverse value is to assign higher IP values to the H-bonds with the close donor-acceptor pairs than those with the distant donoracceptor pairs. Since there are 20 3 = 8,000 amino acid triplets and 4 nucleotides, we computed 32,000 IPs between amino acid triplets and nucleotides.

Encoding a feature vector
To predict RNA-binding amino acids in the protein sequence, we represented several features of protein and RNA sequences in a feature vector. The features can be categorized into three different feature types: (1) global Figure 1 The binding information loss during the redundancy reduction based on measuring sequence similarity. The clustered protein sequences by CDHIT. Protein chain A of the protein-RNA complex 1F7Y (1F7Y:A) was selected as the representative sequence because it was the longest. In the representative sequence, the boxed residues were determined as non-binding residues, but those residues in similar locations were determined to be binding residues in the non-selected protein sequences. Hence, the binding information of non-selected protein sequences was not contained in input training data which would only include the binding information of selected sequences.
features of the protein sequence, (2) local features of amino acids, and (3) partner features. The global features represent the entire sequence information of the target residue when the local features represent the individual information of the residue. Since our prediction model is to predict different binding sites in a protein sequence depending on a partner sequence, the information of an interacting partner sequence was used with the other features.
• Global features of the protein sequence included the sequence length (L) and amino acid composition (C). The amino acid composition represented the frequencies of 20 amino acids in a protein sequence. The global features required a total of 21 elements in a feature vector, one for the sequence length and 20 for the amino acid composition.
• Local features of amino acids included the normalized position (N), hydropathy (H), accessible surface area (A), molecular mass (M), and side chain pK a (P) value of an amino acid, the interaction propensity (IP) of an amino acid triplet. IP is represented as 4 elements, IP_A, IP_C, IP_G, and IP_U, in which IP_A denotes the interaction propensity of the amino acid triplet with the nucleotide adenine (A) (Figure 2). The normalized position of an amino acid in the sequence is calculated by equation (2). Except for the normalized position, a same amino acid or amino acid triplet has the same value for the local features.

Normalized Position
Position • Partner features represent the feature of the RNA (R) sequence that interacts with the protein. For each of the four nucleotides, we encoded the sum of the normalized position of the nucleotide in the RNA sequence. This feature is computed by equation (3) and represented as 4 elements (R A , R C , R G , R U ) in a feature vector. Due to these elements, identical amino acid sequences can be encoded into different feature vectors if they interact with different RNA sequences.
(3) Figure 2 The structure of a feature vector with the window of 9 amino acids. A window of 9 amino acids corresponds to 7 overlapping triplets: . 21 global feature elements (1 L and 20 Cs) and 4 RNA feature elements (R A , R C , R G , R U ) are encoded once for a given pair of protein and RNA sequences. 9 local feature elements (N, H, A, M, P and 4 IPs) are encoded for 7 internal residues, and 5 local feature elements (N, H, A, M, P) for 2 terminal residues. Thus, the feature vector representing a window of 9 residues has a total of 98 (=21 +4+9×7+5×2) feature elements.
Each of the feature elements is normalized into a value in the range of [0, 1] when it is represented in a feature vector. The global features of a protein (1 element for L and 20 elements for C) and its partner feature (4 elements for R) are represented once for the entire protein sequence, but the local features of a protein should be represented for each internal residue (5 elements for N, H, A, M, and P and 4 elements for IP). The IP is not defined for the terminal residue of a window (e.g., a i-4 and a i+4 in Figure 2), so only 5 elements are represented for the terminal residues.
Since we use overlapping triplets for encoding a sequence, a sliding window of w residues corresponds to w -2 triplets. When a sliding window of w residues is used, the feature vector for residue i starts with residue i -(w -1)/2 and covers the triplets Thus, a sequence fragment of w residues is encoded as a feature vector of 9w+17 elements: 21 global elements (1 L and 20 Cs), 4 RNA elements (R A , R C , R G and R U ), 9 local elements (N, H, A, M, P and 4 IPs) for w -2 internal residues, and 5 local elements (N, H, A, M and P) for 2 terminal residues. A feature vector is labeled +1 (positive) if the middle residue of the sequence fragment is a binding residue, and -1 (negative) otherwise. Figure 2 shows an example of a feature vector for an amino acid sequence with a window of 9 amino acids.

Feature vector-based reduction of data redundancy
All of the protein sequences in the protein-RNA interacting pairs are segmented into overlapping sequence fragments of a window size w. From a protein sequence of n amino acids, n sequence fragments are generated and each sequence fragment is encoded into a feature vector. Feature vectors are considered identical only when they have the same elements and labels. When a prediction model is trained by redundant data, a bias towards the over-represented data is introduced during prediction. Thus, a training dataset should be constructed with the most representative data after removing redundant data.
We removed redundant data based on the feature vector representing the data. Figure 3 explains our method with hypothetical sequences and features. In case 1 of Figure 3, the sequence fragments s1 and s3 have the same amino acid sequence and the middle amino acids of s1 and s3 are both binding sites. Thus, the feature vectors v1 and v3, representing the sequence fragments s1 and s3, have the same vector elements and the label. To remove redundant data in a training dataset, only one sequence fragment (s1 in this example) is left and s3 is discarded. On the other hand, the sequence fragments s2 and s4 have the same vector elements but different labels, so their feature vectors v2 and v4 are not identical. Both s2 and s4 are included in the training dataset. In case 2 of Figure 3, an additional feature of the protein, sequence length, is included in a feature vector. Then, the feature vectors v5 and v6 representing the sequence fragments s5 and s6 are no longer the same. Figure 4 compares the feature vector-based redundancy reduction method with the standard redundancy reduction method, which reduces data redundancy based on the sequence similarity. The feature vector-based method constructs a non-redundant training dataset with all possible sequence fragments in the protein sequences, but the sequence similarity-based method discards some sequence fragments and constructs a smaller training dataset than the feature vector-based method. It is also noticeable that the sequence similarity-based method kept the redundant data (Fragment 2 and Fragment 4) in the training dataset, whereas the feature vector-based method did not include redundant data in the training dataset by considering the feature vectors and their labels. When a prediction model is trained by redundant data, the model is biased toward the over-represented data.

Performance measures
The performance of the prediction model was evaluated using seven different measures: sensitivity (Sn), specificity (Sp), precision rate (Pr), accuracy (Ac), net prediction (NP), F-measure (Fm) and correlation coefficient (CC).

TP TN TP FP TN FN
In these equations, the true positives (TP) are binding residues that are predicted as the binding residues, the true negatives (TN) are non-binding residues that are predicted as the non-binding residues, the false positives (FP) are non-binding residues that are predicted as the binding residues, and the false negatives (FN) are binding residues that are predicted as the non-binding residues.
Sensitivity is the percentage of amino acids that are RNA-binding and are correctly predicted as RNA-binding. Specificity is the percentage of amino acids that are not RNA-binding and are correctly predicted as non-binding. Accuracy is the percentage of amino acids that are correctly predicted. But, accuracy may be misleading in highly imbalanced datasets. For example, in a dataset of 10 positive and 90 negative samples, the accuracy becomes as high as 90% if all the samples are classified as negative. Net prediction is the average of sensitivity and specificity. The correlation coefficient is the best single measure for comparing the overall performance of different methods [14].

Datasets of protein-RNA interactions
We constructed three different protein-RNA interaction datasets: PRI3149, PRI727 and PRI267. For the PRI3149 dataset, the protein-RNA complexes were obtained from the Protein Data Bank (PDB) [15]. As of November 2009, there were 442 protein-RNA complexes that were determined by X-ray crystallography with a resolution of 3.0Å or better. After applying the geometric criteria for H bonds to 442 protein-RNA complexes, 429 protein-RNA complexes containing 3,149 pairs of interacting protein-RNA sequences were left that satisfied the criteria. If a protein p interacted with two different RNAs r1 and r2, both pairs p -r1 and p -r2 were included in the dataset. The 3,149 protein-RNA interacting pairs were formed by 2,663 protein sequences and 812 RNA sequences. From the PRI3149 dataset, we constructed a set of non-redundant feature vectors to train the SVM model.
The PRI727 and PRI267 datasets were constructed independently from the PRI3149 dataset solely for testing different methods of predicting RNA-binding residues in the protein sequence. We obtained a total of 107 protein-RNA complexes that had been deposited in PDB since November 2009. After applying the geometric criteria for H bonds to the 107 protein-RNA complexes, 727 protein-RNA interacting pairs with 592 protein sequences and 244 RNA sequences were left to form the PRI727 dataset. Figure 3 An example of feature vector-based reduction of data redundancy. Case 1 shows the procedure of the feature vector-based redundancy reduction method with a hypothetical amino acid-based feature when there is a different typed feature: sequence length feature is combined with the amino acid-based feature to encode a feature vector in case 2. The sequence fragment (s3) is not included in the training dataset in case 1 because it generates a redundant feature vector with the one from the sequence fragment (s1). However, the sequence fragments (s5, s6) become the non-identical feature vectors (v5, v6) when using the sequence length feature in case 2 and both sequence fragments are included in the training dataset.
For a more rigorous evaluation, any pair of protein and RNA sequences in the PRI727 dataset with >60% sequence identity to the sequences in the PRI3149 was removed. As a result, 267 protein-RNA interacting pairs with 192 protein sequences and 211 RNA sequences were left to form the PRI267 dataset. Details of the datasets are available as Additional Files 1, 2, 3.

Feature vector-based reduction of data redundancy
The PRI3149 dataset of 3,149 protein-RNA interacting pairs initially contains 59,398 RNA-binding residues and 542,627 non-binding residues. If redundant data is not removed, the number of positive sequence fragments is the same as that of binding residues and the number of negative sequence fragments is the same as that of nonbinding residues. We represented the 3,149 protein-RNA interacting pairs as feature vectors using two different combinations (all protein features and RNA features vs. local features of protein) of their features and applied the feature vector-based redundancy reduction method to the feature vectors. Table 1 shows the number of remaining feature vectors after applying the feature vector-based redundancy reduction method to the PRI3149 dataset. Common vectors in Table 1 denote the feature vectors with the same vector elements but with different binding labels ('+1' for binding and '-1' for non-binding) ( Figure 4). It is harder to separate different classes in the data with more common feature vectors than those with fewer common feature vectors. As shown in Table 1, using all the features (protein sequence length, amino acid composition, normalized position, hydropathy, accessible surface area, molecular mass, and side chain pK a of an amino acid, IP of an amino acid triplet, sum of the normalized position of each nucleotide type) produced more feature vectors but a smaller proportion of common feature vectors than using the 6 local features of protein (normalized position, hydropathy, accessible surface area, molecular mass, and side chain pK a of an amino acid, IP of an amino acid triplet) consistently in all window sizes. When the 6 local features of sequence fragments were represented, the feature vector-based redundancy reduction method with a larger window size constructed a larger non-redundant dataset. However, when the 9 features were represented, the feature vector-based redundancy reduction method constructed non-redundant datasets of similar size irrespective of the window size. Figure 4 Comparison of the sequence similarity-based method and the feature vector-based method for reducing data redundancy. The sequence similarity-based method removes an entire sequence that is identical or similar to other sequences. When similar sequences are eliminated from a dataset, their binding information is also lost. When the remaining sequence contains repetitive subsequences, redundant data are generated from the subsequences. The feature vector-based method first represents every possible subsequence and its binding information as a feature vector. A subsequence is removed only when it has the same feature vector as others. Subsequences with the same amino acid sequence but different binding information are considered different and both are kept in the training dataset. Table 2 compares the performance of the feature vector-based redundancy reduction method with that of the sequence similarity-based redundancy reduction method in the PRI727 and PRI267 datasets. S-method is the sequence similarity-based redundancy reduction using the CD-HIT program. The number in the parenthesis represents the sequence identity threshold of CD-HIT clusters, and the longest sequence of each cluster was included in a training dataset. F-method is the feature vector-based redundancy reduction. The only difference between S-method and F-method was in their training datasets. The SVM model was trained and tested using 9 features and a window size of 15.
In both the PRI727 and PRI267 datasets, F-method was better than S-method in all performance measures.
Interaction propensity of amino acid triplets with RNA By equation (1), we computed 32,000 interaction propensities between the 8,000 amino acid triplets and 4 nucleotides with the PRI3149 dataset. 18,301 IPs (57.2%) out of the total 32,000 IPs have non-zero values, ranging from 0.0 to 4.789796. The pair of (SHK, U) had the highest IP value (4.79) and CRR showed high IP values (2.46) with all nucleotides on average. In contrast, 2,238 IPs had zero values with all nucleotides. The number of non-redundant feature vectors generated from the PRI3149 dataset by the feature vector-based redundancy reduction method with various window sizes.   S-method is the sequence similarity-based redundancy reduction using the CD-HIT program. The number in the parenthesis indicates the sequence identity threshold of CD-HIT clusters. F-method is the feature vector-based redundancy reduction. The SVM model was trained and tested using 9 features and a window size of 15. NP: net prediction. Fm: F-measure. CC: correlation coefficient.
In addition to the IP of amino acid triplets, we computed the four RNA feature elements (R A , R C , R G , R U ) for the RNA sequences in the PRI3149 dataset using equation (3). The PRI3149 dataset contains 812 RNA sequences, and only 312 sequences are distinguishable from each other. When we represented the four RNA features for the 312 sequences, they became unique feature vectors. The interaction propensities of amino acid triplets and the RNA feature elements computed for the PRI3149 dataset are available Additional Files 4 and 5.
To examine the effect of several definitions of the interaction propensity of amino acids with RNA on prediction performance, we encoded the non-redundant dataset using 3 different definitions of IP: the interaction propensity sIP of single amino acids [5,6], the interaction propensity prev_tIP of amino acid triplets used in our previous study [7], and the interaction propensity tIP of amino acid triplets used in this study. The results shown in Table 3 were obtained by 5-fold cross validation with a window size of 15. The SVM models with the IP of the amino acid triplets (i.e., prev_tIP and tIP) were better than those with the IP of single amino acids (sIP). As a single feature, the new IP of amino acid triplets (tIP) showed the best performance. When the IP was used along with the RNA feature elements (R A , R C , R G , R U ), performance always improved compared to the prediction with the IP only.

Implementation and prediction results
We implemented a set of programs for extracting H bonds with various geometric criteria from PDB files, for encoding a feature vector for the data, and for running a SVM model with several different conditions. A program called HBPLUS (http://www.biochem.ucl.ac.uk/bsm/ hbplus/home.html) is widely used to extract H bonds from PDB files, but it cannot deal with new PDB files that have atom names such as O2′ and OP1 in RNA. In our program for extracting H bonds, atoms O2′ and O3′ of RNA were included as potential donors of H bonds. Likewise, atoms OP1, OP2, O2′, O3′, O4′ and O5′ of RNA were included as potential acceptors of H-bonds.
To evaluate the effect of the window size on predicting RNA-binding amino acids, several datasets were constructed by applying the feature vector-based redundancy method with various window sizes to the PRI3149 dataset. Table 4 shows the prediction performance of the 5-fold cross validation on the SVM models trained by 8 different non-redundant datasets built from the PRI3149 dataset (see Table 1). All the results shown in Table 4 were obtained with the following parameter values: C=10, γ=1/ #feature elements in the dataset, w+ (weight of positive class)=#negative feature vectors/#positive feature vectors, and w-(weight of negative class)=1. The prediction performance improved as the window size increased up to 15 in all performance measures. Therefore, the best prediction performance (an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41) of 5-fold cross validation was achieved from the SVM model with a window size=15.

Comparison with other methods
We compared our method to other machine learning methods for predicting RNA-binding amino acids in a protein sequence. BindN [2] uses a support vector machine with different amino acid features. RNABindR [3,4] predicts RNA-binding residues in a protein sequence using a Naïve Bayes classifier. These methods use different features to encode a feature vector and different machine learning algorithms to build a prediction model, but they performed the sequence similarity-based redundancy reduction method for their training datasets and did not consider the information of interacting partner sequences in predicting RNA-binding residues.
To objectively compare these methods we tested them on the PRI727 dataset and the PRI267 dataset,  separately. Both the PRI727 and the PRI267 datasets were different from the PRI3149 dataset, which was used to train our SVM model. Table 5 shows the prediction performance of the methods with various options. RNABindR was run with three options: high sensitivity (sn), high specificity (sp) and optimal (opt). BindN was executed with two options: expected sensitivity of 80% (sn80) and expected specificity of 80% (sp80). In order to examine the effect of using the RNA feature on the prediction performance, we built two SVM models that used different features. 'Our method 1' in In both the PRI727 and the PRI267 datasets, our SVM model that used the RNA features as well as the protein features (Our method 1) had higher values for both sensitivity and specificity, but the other methods had either high sensitivity or specificity. As well as the high sensitivity and the high specificity, our SVM model (Our method 1) had the higher values for the net prediction, F-measure and the correlation coefficient than the other methods including our SVM model that used protein features only (Our method 2). Our SVM model that used protein features only (Our method 2) achieved the similar or better prediction performance than the existing methods. This result shows the feature vector-based method and the features are useful to construct a highly accurate prediction model in prediction of RNA-binding residues. Details of the prediction results are available in Additional Files 6 and 7. Figure 5 shows an example of prediction by the SVM model (Our method 1 in Table 5) for protein chain B with RNA chain D in a protein-RNA complex (PDB ID: 3OVB).

Conclusions
Most learning approaches to predicting RNA-binding residues in a protein sequence construct a training dataset based on the sequence similarity. During the process of removing redundancy in sequence data a whole sequence is either taken or discarded for the training dataset. Similar sequences or even identical sequences often have very different binding sites when their binding partners change. However, much binding information is lost when a training data is constructed by the sequence similarity-based redundancy reduction method.
We developed a feature vector-based method for removing data redundancy. Our method constructed a larger training dataset of non-redundant data than the standard sequence similarity-based reduction method. Furthermore, the training dataset constructed by the feature vector-based method did not contain redundant data, whereas the dataset built by the sequence similarity-based method was likely to produce redundant data when a single sequence contains similar subsequences within the sequence. Previous approaches to predicting RNA-binding residues in a protein sequence do not consider the interacting partner (i.e., RNA) of a protein. As a result, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. We took both protein and RNA sequences as input, and considered RNA feature to predict RNA-binding residues in the protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on the 3,149 protein-RNA interacting pairs.
For a more rigorous evaluation we tested our SVM model on two new datasets, which were not used in training the model. In the new datasets, our SVM model showed an accuracy of 90.3%, an F-measure of 72.8% and a correlation coefficient of 0.24. Comparison of our SVM model with other methods on the same datasets demonstrated that ours is better than other methods.
In this study we identified effective features of protein and RNA (i.e., the interaction propensity of amino acid triplets and RNA features) for predicting RNA-binding residues and developed a new data redundancy method for constructing a training dataset. These will be useful in other studies of protein-RNA interactions. Figure 5 Prediction of binding sites in protein chain B with RNA chain D of 3OVB. 21 binding amino acids (blue balls, TP) and 408 nonbinding amino acids (orange ball and sticks, TN) were predicted correctly. 12 non-binding amino acids were incorrectly predicted as binding (yellow balls, FP), and there was no binding amino acids that were incorrectly predicted as non-binding (no FN). In RNA protein-binding nucleotides are represented in dark gray balls and sticks, and non-binding nucleotides in gray wireframes. The '+' symbol below in the text line represents a binding amino acid while the '-' symbol represents a non-binding amino acid. Due to the limited space, the last 295 amino acids of protein chain B are not shown in the sequences. TP: true positives. TN: true negatives. FP: false positives. There are no false negatives (FN) in this example.