A boosting approach for prediction of protein-RNA binding residues

Background RNA binding proteins play important roles in post-transcriptional RNA processing and transcriptional regulation. Distinguishing the RNA-binding residues in proteins is crucial for understanding how protein and RNA recognize each other and function together as a complex. Results We propose PredRBR, an effectively computational approach to predict RNA-binding residues. PredRBR is built with gradient tree boosting and an optimal feature set selected from a large number of sequence and structure characteristics and two categories of structural neighborhood properties. In cross-validation experiments on the RBP170 data set show that PredRBR achieves an overall accuracy of 0.84, a sensitivity of 0.85, MCC of 0.55 and AUC of 0.92, which are significantly better than that of other widely used machine learning algorithms such as Support Vector Machine, Random Forest, and Adaboost. We further calculate the feature importance of different feature categories and find that structural neighborhood characteristics are critical in the recognization of RNA binding residues. Also, PredRBR yields significantly better prediction accuracy on an independent test set (RBP101) in comparison with other state-of-the-art methods. Conclusions The superior performance over existing RNA-binding residue prediction methods indicates the importance of the gradient tree boosting algorithm combined with the optimal selected features.

are most likely to interact with RNA. Liu et al. [12] implement a RF classifier to detect the RNA binding residues in proteins by integrating interaction propensity with other sequence and structural features. Other RNA-binding site prediction methods include PRINTR [13], RNABindRPlus [14], RBScore [15], NBench [16] and SNBRFinder [17].
Although existing studies [7,[9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24] have made remarkable progress to explore the interfaces of protein-RNA interactions, there is still great room for improvement. First, precise biological properties for precisely recognizing RNA-binding sites are not fully uncovered; no single feature can effectively identify protein-RNA interaction residues. Second, the number of non-binding sites is much higher than that of RNA-binding residues, which yields the so-called imbalance problem. Also, the imbalanced data tends to cause over-fitting and poor prediction results. Thus, developing effective approaches to address these issues at both data and algorithmic levels, such as feature extraction and selection, re-sampling techniques and one-class learning, is a pressing need.
In this work, we propose a novel RNA-binding residue prediction method named PredRBR, which takes advantage of Friedman's gradient tree boosting (GTB) [25][26][27] and optimal selected features. PredRBR uses the GTB algorithm to iteratively build multiple classification trees based on the 44 optimal features selected from a series of sequence and structural features, especially two categories of structural neighborhood properties. The promising results of cross-validation and independent test demonstrate the effectiveness of PredRBR.

Datasets
We use RBP170 (previously named as RBP199) [13] as the training data set. The proteins in RBP199 were obtained from the protein-RNA complexes in Protein Data Bank (PDB) [28] as of May 2010. PISCES [29] was used to remove proteins with < 30% sequence identity or structures with resolution worse than 3.5Å. Proteins with residues < 40 or RNA-binding residues < 3 or the binding RNA with nucleotides < 5 were further excluded. Since there are 9 complexes (3HUW, 3I1M, 3I1N, 3KIQ, 2IPY, 2J01, 2QBE, 2Z2Q, 3F1E) in PDB obsoleted, a total of 170 protein sequences are generated.
Another independent dataset (BPP101) is collected from PDB with deposition date from June 2010 to May 2014. Similar to RBP170, only non-redundant and highquality RNA-binding proteins are selected (sequence identity < 30% and resolution better than 3.5 Å). We also use CD-HIT [30,31] to remove proteins with sequence similarity >40% to all proteins in RBP170. Finally, 101 protein sequences are obtained from 90 RNA-binding complexes.

Features extraction
A total of 63 sequence and structural site features (SiteFs) are calculated as follows: Physicochemical properties (10 features): The ten physicochemical properties are obtained from the AAindex database [32], including number of atoms, number of electrostatic charge, number of potential hydrogen bonds, molecular mass (Mmass), hydrophobicity, hydrophilicity, polarity, polarizability, propensities and average accessible surface area [33].
Side-chain environment (pKa, 2 features): The sidechain environment pKa scores are extracted from Nelson and Cox [34] representing the side-chain environmental features of a protein.
Evolutionary conservation score (C-score, 1 feature): We use Rata4Site [39] to calculate the C-score for each residue based on the sequence alignments.
Solvent accessible area (ASA, 2 features): ASA properties are computed using DSSP [40], and the maximum solvent accessibility are calculated based on Rost and Sander [41].
Secondary Structure (SS, 3 features): The secondary structure is also calculated using DSSP. The secondary structure can be divided into three categories: helix, sheet and coil. We encode the secondary structure as a 3-d vector. In the results of DSSP, types G, H and I are helix (1, 0, 0); types B and E are sheet (0, 1, 0); types T, S and blank are recognized as coil (0, 0, 1).

Interaction propensity (IP, 4 features):
Interaction propensity is first introduced by Liu [12]. The interaction propensity between the residue triplet t and the nucleotide n is defined as follows: where In the above formulas, f (P,R) (t, n), f P (t) and f R (n) represent the frequency of amino acid triplet t that binds to nucleotide n in the protein-RNA pair (P, R), the frequency of triplet t in protein P and the frequency of nucleotide n in RNA R, respectively. N (P,R) (t, n) is the number of the amino acid triplet t interacting with nucleotide n in protein-RNA pair (P, R); t,n N (P,R) (t, n) is the total number of residue triplets that bind to any nucleotides in the protein-RNA pair (P, R); N P (t) is the number of triplet t in protein P; P N P (t) is the total number of amino acid triplets; N R (n) is the number of nucleotide n in RNA R and R N R (n) is the total number of nucleotides in the dataset. A total of 32,000 IPs are calculated for the 4 nucleotides and 20 3 (8,000) residue triplets. For each residue, four features(IP A , IP U , IP G , IP C ) are used to represent the interaction propensity (IP) of the residue triplet corresponding to different nucleotides (A, U, G and C). The disorder score is pred icted using the method proposed by Obradovic et al. [42,43].
Atom contacts and residue contacts (2 features): We calculate the atom contacts (NC a ) of an amino acid by aggregating all-atom contacts (C a ) between the amino acid and any other residue in the protein, then dividing the number of atoms in the amino acid, as described in our previous work [44,45]. Similarly, we compute the residue contacts (NC r ) by summing all the contacts of the amino acid and then dividing the number of atoms in the amino acid.
Pair potentials (PP, 1 feature): Contact potential (CP) between residue i and j is defined as follows: where P i,j is the contact potential of pair (i, j) collected from the work of Keskin et al. [46]; d i,j is the distance between residue i and j. Note that the neighbors of a target residue are defined as a sphere of a certain radius of 7.0Å [47] based on the side chain center of mass. The overall contact potential of residue i (PP i ) is calculated as follows: Topographical index (1 feature): The topographical score describes the structural environment of a amino acid. We compute the rate between structurally neighbor amino acids and the average number of residues for a specific amino acid type [44,45,48].
Local structural entropy (LSE, 2 features): The local structural entropy [49] of a residue is calculated based on the protein sequence. The potential of a amino acid within a secondary structure (β-bridges, extended β-sheets, 3 10 -helices, α-helices, π-helices, bends, turns and other types) is estimated. More secondary structures the residue appeared in, the higher LSE score will be assigned. We compute the LSE score of a specific residue by averaging four successive sequence windows along the protein sequence. We also define a new attribute named LSE to measure the difference of LSE value between the wild-type protein and its mutants.
Four-body statistical pseudo-potential (FBS2P, 1 feature): The FBS2P score is based on the Delaunay tessellation of proteins [50], which can be calculated as a log-likelihood ratio: where i, j, m and n are identities of the four amino acids (20 possibilities) in a Delaunay tetrahedron of the protein.
Each point represents a residue. f α ijmn is the observed frequency of the residue composition (ijmn) in a tetrahedron of type α over a set of protein structures, while p α ijmn is the expected random frequency.
Side chain energy score (SCE-score, 6 features): The SCE-score is a linear combination of multiple energetic terms, including surface area of atom binding, overlap volume, hydrogen bonding energy, electrostatic interaction energy, buried hydrophobic SAS area and buried SAS area between the target residue and the rest of the protein, respectively [50].
Voronoi contacts (2 features): The Voronoi contact is calculated based on the Voronoi neighbors in protein structure, as described in Ref. [51].

Structural Neighborhood Features (SNF-EDs & SNF-VDs):
In this work, two types of structural neighborhood features (Euclidean and Voronoi) are used. This two structural neighborhood groups named as SNF-EDs and SNF-VDs are defined based on Euclidean distance and Voronoi division [44] respectively. The SNF-EDs is a set of residues located within a sphere of 10Å in Euclidean distances from the central residue. The feature i for a neighbor n (the n-th residue) with regard to the target residue r (the r-th residue) is defined as follows: and d r,n ≤ 10Å, 0 otherwise, (8) where d r,n is the minimum Euclidean distance between any heavy atoms of residue r and that of residue n. The SNF-EDs of target residue r is defined as: where m is the total number of Euclidean neighbors. We also use Voronoi division to define neighbor residues. For each protein 3D structure, the 3D space is partitioned into Voronoi polyhedra around individual atoms. A pair of residues are defined to be Voronoi neighbors when there exits a Voronoi facet in common for the two residues. The Qhull package [52] is used to compute Voronoi division.
Give the target residue r and its neighbors n {n = 1, ..., m}, for each site feature i, a Voronoi neighborhood property is defined as: where P i (n) is the value of the residue feature i for neighbor n.
Finally, a large number of 63 × 3 = 189 site, Euclidean and Voronoi characteristics [53] are obtained for RNAbinding site prediction.

Gradient tree boosting algorithm
The Gradient Tree Boosting (GTB) [25][26][27] is an effective ensemble method for regression and classification issues. Here we apply GTB to predict RNA binding residues. For the input feature vectors N, where "-1" denotes nonbinding resides and "+1" represents RNA-binding sites. The details of the GTB algorithm is shown in Algorithm 1.

Input:
Data set: Compute the negative gradient as the working response Fit a classification model to r i by Logistic function using the input χ i and get the estimate α m of βh(χ; α) 5: In this algorithm, the number of iterations is initialized as M; L(y, (x)) is the log loss function; y represents the label and (χ) is a decision function; N is the number of residues in RBP170. The GTB algorithm iteratively repeats steps 2-7 to build m different classification trees h(χ, α 1 ), h(χ, α 2 ), ..., h(χ, α m ) from a set of training data. β m is the weight and α m is the parameter vector of the mth tree h(χ, α m ). At the end, we can obtain the function M (χ) and build a GTB model˜ (χ). Note that the GTB algorithm is implemented using scikit-learn [54].

The PredRBR framework
The flow chart of PredRBR is shown in Fig. 3. A wide range of sequence and structural site features (63 SiteFs), and two groups of neighborhood attributes (63 SNF-EDs and 63 SNF-VDs) are computed. We use the Maximum Relevance Minimum Redundancy and Incremental Feature Selection (mRMR-IFS) [55] approach to select a small subset of optimal features that make the greatest contribution to the classification. maximum Relevance Minimum Redundancy (mRMR) mRMR means that a feature may be selected preferentially has the maximal correlation with the target attribute and minimal redundancy with the characteristics already chosen. mRMR is measured with mutual information (MI), and the definition is as follows: where x and y are two random attributes; p(x, y) is the joint probabilistic density; p(x) and p(y) are the marginal probabilistic densities. The detailed description of mRMR can be found in Ref. [55]. An ordered list of features are obtained by applying mRMR to the benchmark RBP170 with 189 features.

Incremental Feature Selection (IFS)
Based on the ordered feature list generated by mRMR, we use IFS to decide the optimal feature set. A total number of n feature sets are generated based on the mRMR results as follows: where f i is the i − th sorted feature; F i is the i − th feature set; n is the number of features. We use the GTB algorithm to build classifiers based on each feature subset F i and evaluate the performance with 10-fold crossvalidation. We select the feature subset with the highest overall performance (AUC+MCC) as the optimal feature set.
Deal with the imbalance problem In the benchmark RBP170, the amount of non-binding sites is about 6 times that of RNA binding sites. To deal with the imbalance problem, we use a random under-sampling strategy to generate the new balanced datasets. In the training set, negative samples (non-binding sites) are randomly selected and combined with the positive samples create a 1:1 balance dataset.

Evaluation measures
To evaluate the performance of PredRBR, some widely used measurements are also adopted, including sensitivity (SN/Recall), specificity (SP), precision (Pre), accuracy (ACC), F-measure and Matthews Correlation Coefficient (MCC) score. These metrics are defined as follows: In these equations, the TP, TN, FP, FN refer to the numbers of true positive, true negative, false positive and false negative residues in the prediction, correspondingly. In addition, the ROC graph is formed by plotting the false positive rate (i.e. 1 -specificity) against the true positive rate, which equals sensitivity. Furthermore, the area under the receiver operating characteristic (ROC) [56] curve (AUC) is also utilized for evaluating prediction performance.

Results and discussion
In this section, we first tested the prediction performance of the PredRBR model with different combinations of features, including PSSMs, site features (SiteFs) and structural neighborhood features (SNF-EDs & SNF-VDs), and compared the performance of SiteFs and structural neighborhood features. Then, the mRMR-IFS method is used to select the optimal feature set from all obtained properties. We also implemented many machine learning algorithms using the selected features and compared the prediction performance of gradient tree boosting classifier with these methods using 10-fold cross-validation. Finally, we compared the PredRBR model with existed previous approaches on the same independent test set, and an example of the predicted interface residues with RNA in the protein 3R2C:A is provided to illustrate the proposed method.

Evaluation of different feature combinations
In previous approaches, many combinations of features have been widely applied to get improved predictions of protein-RNA interaction residues, including physicochemical features, side-chain environment, sequence conservation score, position-specific scoring matrices (PSSMs), relative accessible surface area (RASA), secondary structure (SS), interaction propensity and so on. Based on these researches [7,[9][10][11][14][15][16][17][18][19][20][21][22][23][24], we combined a variety of features of the amino acids to represent the specific interaction attributes of protein residues with RNA nucleotides. In this work, some of the site characteristics, such as relative accessible surface area, secondary structure and interaction propensity, can be calculated only after the protein structure information is available. Thus, we categorize these site features into structurebased characteristics, and others are sequence features. To investigate the performances of different features combinations, including the mRMR-IFS selected features, we build a series of sub-models based on the those features and compared the prediction performances of these model using 10-fold cross-validation on the RBP170 dataset. The detailed results are depicted in Table 1. The performance of each model is measured by seven metrics: accuracy (ACC), sensitivity (SN), specificity (SP), Precision, F-measure, MCC and area under curve (AUC). Note that the site features (SiteFs) is the 63D basic sequence and structure properties, including none of structural neighborhood features, and the PSSM column in Table 1 is a subset of the site features. As shown in Table 1, the performance of prediction based on PSSM is not so good, at least not reach our research aims. In contrast, the method with site features (SiteFs) achieves a relatively good performance with a AUC value of 0.84, there is at least 5% increase in overall accuracy, sensitivity, specificity, MCC, F-measure and AUC score compared with PSSM. The Euclidean neighborhood features (SNF-EDs) outperforms PSSM and SiteFs, with at least a 3% improvement on AUC score, which suggests that SNF-EDs is an important feature type for predicting protein-RNA binding residues. When combining all of the structural neighborhood features (SNF-EDs+SNF-VDs), the improvement on performance is impressive, at least 4% increase in ACC and 5% increase in AUC score compared with site features (SiteFs). The optimal 177 features (Top177) are selected from the full combined features (SiteFs+SNF-EDs+SNF-VDs ) with an   [55]) and achieve the best performance.

Contribution of feature selection
Selecting the most informative features is essential for the prediction performance enhancement, and may consequently improve our understanding of the molecular mechanism of RNA-binding sites. A total of 189 site, Euclidean and Voronoi features are initially calculated. We use mRMR-IFS [55], a filter-based approach to rank the features and select the top k attributes. The classifier with the top 177 features achieves the highest performance (MCC =0 .55 and AUC = 0.92) in cross-validation on RBP170 (Fig. 4). We select the 177 optimal features to build the final RNA-binding site prediction model. As shown in Table 1, the performance of the top 177 features selected using mRMR-IFS is significantly better than that of other feature combinations.
We also analyze the numbers of sits (SiteFs), Euclidean (SNF-EDs) and Voronoi (SNF-VDs) features that occurred in the top N characteristics sorted by using the mRMR method, respectively. Figure 5 shows the numbers of the three categories of features exited in the top N (range from 10 to 100) selected properties. We observed that structural neighborhood characteristics (SNF-EDs and SNF-VDs) [44] occupy the majority of the top N list, implying that structural neighborhood characteristics paly a critical role in boosting the performance of RNA-binding residue prediction.

Performance comparison with other machine learning methods
We further compare the effectiveness of PredRBR with existing state-of-the-art machine learning methods, including Support Vector Machine (SVM) [57], Random Forest (RF) [58] and Adaboost [59]. Table 2 shows the prediction results of these classifiers. It is worth indicating that all examined methods employ the same feature set on the training dataset (RBP170) with 10-fold crossvalidation. With a specificity of 0.84, PredRBR obtains a sensitivity of 0.85, a precision of 0.47, a F-measure of 0.60 and a MCC value of 0.55. The best one among these compared machine learning methods is Random Forest with its sensitivity of 0.81 and specificity of 0.83 as well as F-measure of 0.57. Comparing with Random Forest, PredRBR obtains at least 2% increase in sensitivity, 7% increase in MCC value and 5% increase in F-measure. PredRBR also achieves higher AUC score than that of other comparison machine learning approaches. The AUC score of PredRBR is 0.92, while those of the three machine learning methods are in the range of 0.87∼0.90. The results imply that our proposed GTBbased PredRBR model plays crucial role in performance boosting.

Results of the independent evaluation
We validate the usability of the proposed PredRBR model on the independent test dataset. The independent test dataset (RBP101) has 101 non-homologous proteins Due to the imbalance between positive sample and negative sample, the receiver operating characteristic (ROC) curve is regarded as proper measurement to evaluate the overall performance. Higher curve of ROC represents better prediction accuracy. Figure 6 shows the ROC curves and AUC scores of PredRBR and other machine learning methods on the RBP101 dataset. PredRBR, SVM, Adaboost and Random Forest achieve AUC values of 0.82, 0.80, 0.78 and 0.76, respectively. Comparing with the other methods, the PredRBR model improves the AUC score by 2%∼6%. We compare PredRBR with several existing state-of-theart RNA-binding residue prediction approaches, including BindN [9], PPRint [20], Liu-2010 [12], BindN+ [22], RNABindR2.0 [23], RNABindRPlus [14] and SNBRFinder [17] on the independent set (RBR101). In these methods, BindN [9], BindN+ [9] and PPRint [20] use SVM to build the RNA-binding site classifier; RNABindRPlus [14] utilizes a logistic regression method to integrate the homology-based method HomPRIP and optimized SVM model named SVMOpt; Liu-2010[12] is RF-based method with sequence and structural features especially the proposed interaction propensity, and SNBRFinder [17] is a hybrid method based on the sequence features.
As shown in Table 3, PredRBR achieves the best predictive performance with an accuracy of 0.83, a sensitivity of 0.59, specificity of 0.85, precision of 0.28, F-measure of 0.38 and MCC of 0.32. The results indicate that 59% of the real RNA-binding residues are correctly identified (sensitivity), and 85% of the non-RNA binding residues are precisely predicted (specificity). In the control methods, SNBRFinder gains the best prediction results (sensitivity=0.65, specificity=0.80, F-measure=0. 36 and MCC=0.31). The performance our PredRBR method goes beyond SNBRFinder regarding F-measure and MCC. Particularly, the specificity of PredRBR is significantly better than that of RNABindR (increased by 5%), which suggests that PredRBR would be able to determine the residues that do not exist in the RNA-binding surface better and reduce the experiment cost. The ROC curves of PredRBR and other existing methods are shown in Fig. 7, which are drawn by varying the cutoffs of the prediction scores to calculate the sensitivities and specificities of these methods. The AUC scores (areas under ROC curves) of the eight methods, including PredRBR, SNBRFinder, RNABindRPlus , RNABindR 2.0, BindN+, PPRint, Liu-2010, BindN, are about 0.82, 0.80, 0.73, 0.72, 0.72, 0.68, 0.66 and 0.64, respectively. These improvements on the prediction indicate that our proposed Pre-dRBR method integrating the GTB algorithm and the optimal selected 177 features particularly the structural neighborhood properties can effctively predict RNAbinding residues.

Case study
The ternary NusB-NusE-BoxA RNA complex (PDB code 3R2C) initiates the complete antitermination complex required by the processive transcription antitermination. The complex NusB-NusE-BoxA reveals the significance of  Figure 8 shows the comparison between actual interaction residues and predicted RNA binding residues in the protein 3R2C:A. Figure 8a presents the actual interaction residues of protein 3R2C:A and the red spheres represent real RNA binding residues. Figure 8b shows the binding sites predicted by PredRBR. The results show that most of the actual interaction residues are well identified by the PredRBR model.

Conclusion
In this study, we have developed PredRBR, a highperformance protein-RNA binding site prediction method. The novelty of the proposed method lies in the idea that we widely integrate a large number of sequence, structural and energetic characteristics, together with two categories of Euclidian and Voronoi neighborhood features, produces more critical clues for RNA-binding residue prediction. A total of 63 site-based, Fig. 8 Comparison between experimentally determined RNA binding sites (a) and predicted RNA binding residues (b). a Actual RNA-binding residues in protein 3R2C:A. Result of (b) is predicted binding residues by PredRBR and the numbers of predicted TP, FP, TN and FN in 3R2C:A are 30, 8, 92, and 8, respectively. The true positive, true negative, false positive and false negative residues are shown in red, yellow, black and blue, respectively 63 Euclidian and 63 Voronoi neighborhood features have been obtained. We use the mRMR-IFS approach to select an optimal subset of 177 features to reduce the computational time and improve the performance. Our results also highlight the benefits of basing RNA-binding residue prediction method on the GTB algorithm and structural neighborhood characteristics (Euclidian and Voronoi). Both cross-validation and independent test show that PredRBR performs significantly better than other existing state-of-the-art methods such as Liu-2010, BindN+, RNABindRPlus, BindN, PPRint, SNBRFinder and RNABindR2.0. Furthermore, we demonstrate the effectiveness of our approach to an RNA binding complex and obtained encouraging results. A limitation of PredRBR is that it is a structurebased approach, which use an encoding of sequence and structure-derived features of a target residue and its structural neighborhood features to make predictions. RNAbinding sites of proteins without known 3D structures can't be well predicted. However, the number of proteins with known structures has increased rapidly in the past few years especially due to the accurate theoretical models that can be produced when using the solved representatives as templates for the models.
In the future, we will try to extract more effective features and machine learning methods to further improve the RNA-binding residue prediction. Also, we will develop an open access web-server for the proposed PredRBR method.