Prediction of RNA-binding amino acids from protein and RNA sequences
© Choi and Han; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules.
We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others.
The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence.
Interactions between proteins and RNA are fundamental to many cellular processes . Much experimental and theoretical effort has been made to study protein-RNA interactions, but their precise mechanism is not fully understood. Motivated by the recent increase in structures of protein-RNA complexes, several theoretical studies such as supervised learning have been carried out to predict RNA-binding residues in protein sequences. For example, BindN  uses a support vector machine (SVM) to predict the RNA- or DNA-binding residues in a protein sequence based on the chemical properties of amino acids. RNABindR [3, 4] predicts the RNA-binding residues in a protein sequence using a Naïve Bayes classifier. However, none of these consider interacting partners (i.e., RNA) for a given protein when predicting RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules.
We previously studied the interactions between protein and RNA [5, 6]. In an effort to discover binding-specific features of amino acids and nucleotides, we performed an extensive analysis of the recent structures of protein-RNA complexes and computed several types of interaction propensity (IP) between amino acids and nucleotides . Our analysis revealed that the IP of amino acid triplets has a higher binding specificity than IP of individual amino acids or other biochemical properties. In this study, we modified the previous interaction propensity of amino acid triplets and computed the new interaction propensity from more structures of protein-RNA complexes. In addition to the interaction propensity, we identified several features of protein and RNA sequences which are effective for predicting RNA-binding amino acids in a protein sequence.
In the standard redundancy reduction practice for sequence data, a sequence is included in a training dataset as a whole or discarded from a training dataset. It is a take all or nothing approach, so no partial sequence of the original sequence is allowed to be included in a training dataset. As a solution, we proposed a feature vector-based reduction of data redundancy. In the feature vector-based approach, two identical feature vectors with different binding labels (i.e., one is binding and the other is non-binding) are considered as different feature vectors and both are included in a training dataset. The feature vector-based redundancy reduction method constructs a powerful training dataset, especially when there is not sufficient data available.
This paper describes a support vector machine (SVM) model that predicts RNA-binding residues by considering both RNA and protein sequences. The SVM model was trained on a non-redundant dataset constructed by the feature vector-based redundancy reduction method. To the best of our knowledge, this is the first attempt to predict RNA-binding amino acids by considering the RNA sequence that interacts with the protein. The rest of the paper presents the details of the SVM model and its experimental results.
Definition of protein-RNA binding sites
Different studies have used slightly different criteria for protein-RNA binding sites. For example, in BindN  and RNABindR [3, 4] an amino acid with an atom within a distance of 5Å from any other atom of a nucleotide was considered to be an RNA-binding amino acid. However, we use stricter criteria than that. In our study, an RNA-binding amino acid should satisfy the following geometric criteria for hydrogen boding (H bond) interaction with RNA: the contacts with the donor-acceptor (D-A) distance < 3.9Å, the hydrogen-acceptor (H-A) distance < 2.5Å, the donor-hydrogen-acceptor (D-H-A) angle > 90° and the hydrogen-acceptor-acceptor antecedent (H-A-AA) angle > 90°.
These criteria are slightly different from the ones used in our previous studies [6, 12] but they are the most commonly used criteria for H bonds. In particular, the criteria of a H-A distance < 2.5Å and a D-H-A angle > 90° are essential for the H bonds .
Interaction propensity of amino acid triplets
This definition of IP tb in equation (1) slightly differs from the definition we used in our previous studies [5–7]: (1) IP tb uses the inverse of the value that the H-A distance multiplied by cosine of D-A-H angle of the H-bonds between an amino acid in a triplet and a nucleotide instead of using the H-bonds between an amino acid and a nucleotide, and (2) IP tb uses amino acid triplets instead of individual amino acids.
In equation (1), is the sum of the inverse projected distance of H-A on the D-A between triplet t and the binding nucleotide b, N PR is the total number of amino acid triplets that bind to any nucleotides, N t is the number of triplets t, N P is the total number of amino acid triplets, N b is the number of nucleotide b and N R is the total number of nucleotides in the dataset. The purpose of using the projected distance of H-A distance on D-A in the H bonds is to consider H-A distance as well as the locational relationship to D-A. The purpose of using the inverse value is to assign higher IP values to the H-bonds with the close donor-acceptor pairs than those with the distant donor-acceptor pairs. Since there are 203 = 8,000 amino acid triplets and 4 nucleotides, we computed 32,000 IPs between amino acid triplets and nucleotides.
Encoding a feature vector
To predict RNA-binding amino acids in the protein sequence, we represented several features of protein and RNA sequences in a feature vector. The features can be categorized into three different feature types: (1) global features of the protein sequence, (2) local features of amino acids, and (3) partner features. The global features represent the entire sequence information of the target residue when the local features represent the individual information of the residue. Since our prediction model is to predict different binding sites in a protein sequence depending on a partner sequence, the information of an interacting partner sequence was used with the other features.
Global features of the protein sequence included the sequence length (L) and amino acid composition (C). The amino acid composition represented the frequencies of 20 amino acids in a protein sequence. The global features required a total of 21 elements in a feature vector, one for the sequence length and 20 for the amino acid composition.
Local features of amino acids included the normalized position (N), hydropathy (H), accessible surface area (A), molecular mass (M), and side chain pK a (P) value of an amino acid, the interaction propensity (IP) of an amino acid triplet. IP is represented as 4 elements, IP_A, IP_C, IP_G, and IP_U, in which IP_A denotes the interaction propensity of the amino acid triplet with the nucleotide adenine (A) (Figure 2). The normalized position of an amino acid in the sequence is calculated by equation (2). Except for the normalized position, a same amino acid or amino acid triplet has the same value for the local features.
Partner features represent the feature of the RNA (R) sequence that interacts with the protein. For each of the four nucleotides, we encoded the sum of the normalized position of the nucleotide in the RNA sequence. This feature is computed by equation (3) and represented as 4 elements (R A , R C , R G , R U ) in a feature vector. Due to these elements, identical amino acid sequences can be encoded into different feature vectors if they interact with different RNA sequences.
Each of the feature elements is normalized into a value in the range of [0, 1] when it is represented in a feature vector. The global features of a protein (1 element for L and 20 elements for C) and its partner feature (4 elements for R) are represented once for the entire protein sequence, but the local features of a protein should be represented for each internal residue (5 elements for N, H, A, M, and P and 4 elements for IP). The IP is not defined for the terminal residue of a window (e.g., a i –4 and a i +4 in Figure 2), so only 5 elements are represented for the terminal residues.
Since we use overlapping triplets for encoding a sequence, a sliding window of w residues corresponds to w – 2 triplets. When a sliding window of w residues is used, the feature vector for residue i starts with residue i - (w – 1)/2 and covers the triplets T(i – (w – 1)/2 – 1),T(i – (w – 3)/2 – 1), …,T(i + (w – 3)/2 – 1) and T(i + (w – 1)/2 – 1). Thus, a sequence fragment of w residues is encoded as a feature vector of 9w+17 elements: 21 global elements (1 L and 20 Cs), 4 RNA elements (R A , R C , R G and R U ), 9 local elements (N, H, A, M, P and 4 IPs) for w – 2 internal residues, and 5 local elements (N, H, A, M and P) for 2 terminal residues. A feature vector is labeled +1 (positive) if the middle residue of the sequence fragment is a binding residue, and -1 (negative) otherwise. Figure 2 shows an example of a feature vector for an amino acid sequence with a window of 9 amino acids.
Feature vector-based reduction of data redundancy
All of the protein sequences in the protein-RNA interacting pairs are segmented into overlapping sequence fragments of a window size w. From a protein sequence of n amino acids, n sequence fragments are generated and each sequence fragment is encoded into a feature vector. Feature vectors are considered identical only when they have the same elements and labels. When a prediction model is trained by redundant data, a bias towards the over-represented data is introduced during prediction. Thus, a training dataset should be constructed with the most representative data after removing redundant data.
In these equations, the true positives (TP) are binding residues that are predicted as the binding residues, the true negatives (TN) are non-binding residues that are predicted as the non-binding residues, the false positives (FP) are non-binding residues that are predicted as the binding residues, and the false negatives (FN) are binding residues that are predicted as the non-binding residues.
Sensitivity is the percentage of amino acids that are RNA-binding and are correctly predicted as RNA-binding. Specificity is the percentage of amino acids that are not RNA-binding and are correctly predicted as non-binding. Accuracy is the percentage of amino acids that are correctly predicted. But, accuracy may be misleading in highly imbalanced datasets. For example, in a dataset of 10 positive and 90 negative samples, the accuracy becomes as high as 90% if all the samples are classified as negative. Net prediction is the average of sensitivity and specificity. The correlation coefficient is the best single measure for comparing the overall performance of different methods .
Results and discussion
Datasets of protein-RNA interactions
We constructed three different protein-RNA interaction datasets: PRI3149, PRI727 and PRI267. For the PRI3149 dataset, the protein-RNA complexes were obtained from the Protein Data Bank (PDB) . As of November 2009, there were 442 protein-RNA complexes that were determined by X-ray crystallography with a resolution of 3.0Å or better. After applying the geometric criteria for H bonds to 442 protein-RNA complexes, 429 protein-RNA complexes containing 3,149 pairs of interacting protein-RNA sequences were left that satisfied the criteria. If a protein p interacted with two different RNAs r 1 and r 2, both pairs p – r 1 and p – r 2 were included in the dataset. The 3,149 protein-RNA interacting pairs were formed by 2,663 protein sequences and 812 RNA sequences. From the PRI3149 dataset, we constructed a set of non-redundant feature vectors to train the SVM model.
The PRI727 and PRI267 datasets were constructed independently from the PRI3149 dataset solely for testing different methods of predicting RNA-binding residues in the protein sequence. We obtained a total of 107 protein-RNA complexes that had been deposited in PDB since November 2009. After applying the geometric criteria for H bonds to the 107 protein-RNA complexes, 727 protein-RNA interacting pairs with 592 protein sequences and 244 RNA sequences were left to form the PRI727 dataset.
For a more rigorous evaluation, any pair of protein and RNA sequences in the PRI727 dataset with >60% sequence identity to the sequences in the PRI3149 was removed. As a result, 267 protein-RNA interacting pairs with 192 protein sequences and 211 RNA sequences were left to form the PRI267 dataset. Details of the datasets are available as Additional Files 1, 2, 3.
Feature vector-based reduction of data redundancy
The PRI3149 dataset of 3,149 protein-RNA interacting pairs initially contains 59,398 RNA-binding residues and 542,627 non-binding residues. If redundant data is not removed, the number of positive sequence fragments is the same as that of binding residues and the number of negative sequence fragments is the same as that of non-binding residues. We represented the 3,149 protein-RNA interacting pairs as feature vectors using two different combinations (all protein features and RNA features vs. local features of protein) of their features and applied the feature vector-based redundancy reduction method to the feature vectors.
Feature vectors generated by the feature vector-based redundancy reduction method to the PRI3149 dataset.
#Positive feature vectors
#Negative feature vectors
with 9 features (L, C, N, H, A, M, pK a , IP, and R)
with 6 features (N, H, A, M, pK a , and IP)
Comparison of the redundancy reduction methods for training datasets.
Interaction propensity of amino acid triplets with RNA
By equation (1), we computed 32,000 interaction propensities between the 8,000 amino acid triplets and 4 nucleotides with the PRI3149 dataset. 18,301 IPs (57.2%) out of the total 32,000 IPs have non-zero values, ranging from 0.0 to 4.789796. The pair of (SHK, U) had the highest IP value (4.79) and CRR showed high IP values (2.46) with all nucleotides on average. In contrast, 2,238 IPs had zero values with all nucleotides.
In addition to the IP of amino acid triplets, we computed the four RNA feature elements (R A , R C , R G , R U ) for the RNA sequences in the PRI3149 dataset using equation (3). The PRI3149 dataset contains 812 RNA sequences, and only 312 sequences are distinguishable from each other. When we represented the four RNA features for the 312 sequences, they became unique feature vectors. The interaction propensities of amino acid triplets and the RNA feature elements computed for the PRI3149 dataset are available Additional Files 4 and 5.
The effect of IP and RNA features on prediction performance.
sIP + R
prev_tIP + R
tIP + R
Implementation and prediction results
We implemented a set of programs for extracting H bonds with various geometric criteria from PDB files, for encoding a feature vector for the data, and for running a SVM model with several different conditions. A program called HBPLUS (http://www.biochem.ucl.ac.uk/bsm/hbplus/home.html) is widely used to extract H bonds from PDB files, but it cannot deal with new PDB files that have atom names such as O2′ and OP1 in RNA. In our program for extracting H bonds, atoms O2′ and O3′ of RNA were included as potential donors of H bonds. Likewise, atoms OP1, OP2, O2′, O3′, O4′ and O5′ of RNA were included as potential acceptors of H-bonds.
The prediction performance with different window sizes.
Comparison with other methods
We compared our method to other machine learning methods for predicting RNA-binding amino acids in a protein sequence. BindN  uses a support vector machine with different amino acid features. RNABindR [3, 4] predicts RNA-binding residues in a protein sequence using a Naïve Bayes classifier. These methods use different features to encode a feature vector and different machine learning algorithms to build a prediction model, but they performed the sequence similarity-based redundancy reduction method for their training datasets and did not consider the information of interacting partner sequences in predicting RNA-binding residues.
Comparison of prediction methods on the PRI727 and PRI267 datasets.
Our method 1
Our method 2
Our method 1
Our method 2
Most learning approaches to predicting RNA-binding residues in a protein sequence construct a training dataset based on the sequence similarity. During the process of removing redundancy in sequence data a whole sequence is either taken or discarded for the training dataset. Similar sequences or even identical sequences often have very different binding sites when their binding partners change. However, much binding information is lost when a training data is constructed by the sequence similarity-based redundancy reduction method.
We developed a feature vector-based method for removing data redundancy. Our method constructed a larger training dataset of non-redundant data than the standard sequence similarity-based reduction method. Furthermore, the training dataset constructed by the feature vector-based method did not contain redundant data, whereas the dataset built by the sequence similarity-based method was likely to produce redundant data when a single sequence contains similar subsequences within the sequence.
Previous approaches to predicting RNA-binding residues in a protein sequence do not consider the interacting partner (i.e., RNA) of a protein. As a result, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. We took both protein and RNA sequences as input, and considered RNA feature to predict RNA-binding residues in the protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on the 3,149 protein-RNA interacting pairs.
For a more rigorous evaluation we tested our SVM model on two new datasets, which were not used in training the model. In the new datasets, our SVM model showed an accuracy of 90.3%, an F-measure of 72.8% and a correlation coefficient of 0.24. Comparison of our SVM model with other methods on the same datasets demonstrated that ours is better than other methods.
In this study we identified effective features of protein and RNA (i.e., the interaction propensity of amino acid triplets and RNA features) for predicting RNA-binding residues and developed a new data redundancy method for constructing a training dataset. These will be useful in other studies of protein-RNA interactions.
This work was supported by the Mid-career Researcher Program (2010-0000130) and in part by the Key Research Institute Program (2011-0018394) through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 13, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S13.
- Draper DE: Protein-RNA recognition. Annu. Rev. Biochem. 1995, 64: 593-620. 10.1146/annurev.bi.64.070195.003113.View ArticlePubMedGoogle Scholar
- Wang L, Brown SJ: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 2006, 34: W243-W248. 10.1093/nar/gkl298.PubMed CentralView ArticlePubMedGoogle Scholar
- Terribilini M, Lee JH, Yan C, Jernigan RL, Honavar V, Dobbs D: Prediction of RNA binding sites in proteins from amino acid sequence. RNA. 2006, 12: 1450-1462. 10.1261/rna.2197306.PubMed CentralView ArticlePubMedGoogle Scholar
- Terribilini M, Sander JD, Lee JH, Zaback P, Jernigan RL, Honavar V, Dobbs D: RNABindR: a server for analyzing and predicting RNA-binding sites in proteins. Nucleic Acids Res. 2007, 35: 578-584. 10.1093/nar/gkm294.View ArticleGoogle Scholar
- Kim H, Jeong E, Lee SW, Han K: Computational analysis of hydrogen bonds in protein-RNA complexes for interaction patterns. FEBS Letters. 2003, 552: 231-239. 10.1016/S0014-5793(03)00930-X.View ArticlePubMedGoogle Scholar
- Han K, Nepal C: PRI-Modeler: Extracting RNA structural elements from PDB files of protein-RNA complexes. FEBS Letters. 2007, 581: 1881-1890. 10.1016/j.febslet.2007.03.085.View ArticlePubMedGoogle Scholar
- Yun MR, Byun Y, Han K: Predicting RNA-binding sites in proteins using the interaction propensity of amino acid triplets. Protein and Peptide Letters. 2010, 17: 1102-1110. 10.2174/092986610791760388.View ArticlePubMedGoogle Scholar
- Cheng CW, Su EC, Hwang JK, Sung TY, Hsu WL: Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics. 2008, 9 (Suppl 12): S6-10.1186/1471-2105-9-S12-S6.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar M, Gromiha MM, Raghava GP: Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins. 2008, 71: 189-194. 10.1002/prot.21677.View ArticlePubMedGoogle Scholar
- Liu ZP, Wu LY, Wang Y, Zhang XS, Chen L: Prediction of protein-RNA binding sites by a random forest method with combined features. Bioinformatics. 2010, 26: 1616-1622. 10.1093/bioinformatics/btq253.View ArticlePubMedGoogle Scholar
- Huang Y, Niu B, Gao Y, Fu L, Li W: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26: 680-682. 10.1093/bioinformatics/btq003.PubMed CentralView ArticlePubMedGoogle Scholar
- Shrestha R, Kim J, Han K: Prediction of RNA-binding residues in proteins using the interaction propensities of amino acids and nucleotides. LNCS. 2008, 5226: 114-121.Google Scholar
- Torshine IY, Weber IT, Harrison RW: Geometric criteria of hydrogen bonds in proteins and identification of ’bifurcated’ hydrogen bonds. Protein Engineering. 2002, 15: 359-363. 10.1093/protein/15.5.359.View ArticleGoogle Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000, 16: 412-424. 10.1093/bioinformatics/16.5.412.View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.