iPNHOT: a knowledge-based approach for identifying protein-nucleic acid interaction hot spots

Background The interaction between proteins and nucleic acids plays pivotal roles in various biological processes such as transcription, translation, and gene regulation. Hot spots are a small set of residues that contribute most to the binding affinity of a protein-nucleic acid interaction. Compared to the extensive studies of the hot spots on protein-protein interfaces, the hot spot residues within protein-nucleic acids interfaces remain less well-studied, in part because mutagenesis data for protein-nucleic acids interaction are not as abundant as that for protein-protein interactions. Results In this study, we built a new computational model, iPNHOT, to effectively predict hot spot residues on protein-nucleic acids interfaces. One training data set and an independent test set were collected from dbAMEPNI and some recent literature, respectively. To build our model, we generated 97 different sequential and structural features and used a two-step strategy to select the relevant features. The final model was built based only on 7 features using a support vector machine (SVM). The features include two unique features such as ∆SASsa1/2 and esp3, which are newly proposed in this study. Based on the cross validation results, our model gave F1 score and AUROC as 0.725 and 0.807 on the subset collected from ProNIT, respectively, compared to 0.407 and 0.670 of mCSM-NA, a state-of-the art model to predict the thermodynamic effects of protein-nucleic acid interaction. The iPNHOT model was further tested on the independent test set, which showed that our model outperformed other methods. Conclusion In this study, by collecting data from a recently published database dbAMEPNI, we proposed a new model, iPNHOT, to predict hotspots on both protein-DNA and protein-RNA interfaces. The results show that our model outperforms the existing state-of-art models. Our model is available for users through a webserver: http://zhulab.ahu.edu.cn/iPNHOT/.

Average protrusion index of the total residue in bound state PItb 18 Average protrusion index of the side chain of the residue in bound state Electrostatic potential of the residue 1 82 Electrostatic potential of the neighbor residue of the target residue 2 83 Electrostatic potential of the neighbor residue and the target residue 3 84 Average of esp2 4 85 Average of esp3 5 86 The number of hydrogen bond formed by the residue and the nucleic acids HB1 87 The number of hydrogen bond between side chain of the residue and the nucleic acids HB2 88 If the secondary structure of the residue is alpha helix (H) or not Helix 89 If the secondary structure of the residue is beta-sheet (E) or not Sheet 90 If the secondary structure of the residue is a turn (B,T,S) or not Turn 91 If the secondary structure of the residue is other helix (G,I) or not Helix1 92 If the secondary structure of the residue is loops or not Loop 93 Conservation score CNSV 94 Relative conservation of the actual residue compared to the alanine on a certain position based on the weighted observed percentages  a The explanation of the 10 properties can be found in Table S5.   Table S5.   Table S5.

Residues' physicochemical characteristics
As shown in List 1 of the main text, three of the ten physicochemical features were selected by decision trees: Na (Number of atoms), Nphb (Number of potential hydrogen bonds) and Hdrpo (Hydrophobicity). We calculated the average values of these 3 features for hot spots and non-hot spots. As shown in Fig. S1, the average values of hot spot residues of two features (Na, Hdrpo) are larger than for non-hot spot residues. The differences of these 3 features between hot spots and non-hot spots were analyzed using the Wilcoxon Rank Sum test. As shown in Fig S1, the difference in values of Na is statistically significant with Pvalues of 0.0086, when comparing hot spots to non-hot spots. Among these 3 features, Nphb was selected in our final model by the sequential forward feature selection process.
Although the difference of the feature between hot spots and non-hot spots are not statistically significant, they may be complementary to other selected features.

Solvent accessible surface area
As shown in List 1 of the main text, 6 of the 54 SASA related features were selected by decision trees, which are SAStau, SASbau, SASpau, ∆SASta 7/8 , ∆SASsa 7/8 , and ∆SASnr 7/8 . As shown in Fig. S3, the average values of the 5 features (SAStau, SASbau, ∆SASta 7/8 , ∆SASsa 7/8 , and ∆SASnr 7/8 ) for hot spot residues are larger than for non-hot spot residues. By the Wilcoxon Rank Sum test, we analyzed the statistical significances of the differences of the 6 features for hot spot and non-hot spot residues. Fig S3 shows  which indicate the differences of these features are statistically significant.
After the SFS process, two of them were selected in the final model, which are SAStau and ∆SASsa 7/8 . Only the difference of ∆SASsa 7/8 are statistically significant between hot spot and non-hot spot residues, which again implies the complementarity between different features. Noticed the feature ∆SASsa 7/8 is a unique feature introduced only in this work and our previous work for predicting hot spot on protein-protein interfaces [1], which indicates the square root of the buried side chain solvent accessible surface areas may has more relationship to the binding affinity than the buried side chain solvent accessible surface area.

Electrostatic potential
As shown in List 1 of the main text, 2 of the 5 electrostatic potential related features were selected by decision trees, which are esp1 and esp3. As shown in Fig. S4, the average values of esp1 and esp3 for hot spot residues are both larger than for non-hot spot residues.
By the Wilcoxon Rank Sum test, we analyzed the statistical significances of the differences of the two features for hot spot and non-hot spot residues. Fig S4 shows  After the SFS process, esp3 was selected in the final model. Esp3 is the electrostatic potential of the neighbor residues and the target residue, and esp1 is the electrostatic potential of the target residue. The p-value of esp3 is smaller than esp1, which indicate that the electrostatic potential of the residue patch is better than the single target residue in differentiating hot spot and non-hot spot residues. These electrostatic potential related features have been proposed to predict protein-DNA binding site in our previous works [2,3], and the results in this work show that it is also an effective feature to predict hot spots within protein-NA interfaces.

Secondary structure
As shown in List 1 of the main text, only 1 of the 5 secondary structure related features were selected by decision trees, which is Helix. As shown in Fig S5, the average value of Helix of hot spot residues is 0.279 that is smaller than the value of non-hot spot residues, however, the difference is not statistically significant with a P-value of 0.408.
After the SFS process, the Helix was kept in the final model, which indicates that it may complement with other selected features.

Conservation
In this study, we proposed 4 types new conservation related features in addition to the traditional conservation score based on information entropy. However, only the traditional conservation score, which was named as CNSV, was selected by the decision tree. As shown in Fig S6, the average CNSV of hot spot residues is 1.26 that is a little smaller than the value of non-hot spot residues 1.30, however, the difference is not statistically significant according to the P-value of 0.798.
After the SFS process, the CNSV was not selected in the final model.