Improve hot region prediction by analyzing different machine learning algorithms

Background In the process of designing drugs and proteins, it is crucial to recognize hot regions in protein–protein interactions. Each hot region of protein–protein interaction is composed of at least three hot spots, which play an important role in binding. However, it takes time and labor force to identify hot spots through biological experiments. If predictive models based on machine learning methods can be trained, the drug design process can be effectively accelerated. Results The results show that different machine learning algorithms perform similarly, as evaluating using the F-measure. The main differences between these methods are recall and precision. Since the key attribute of hot regions is that they are packed tightly, we used the cluster algorithm to predict hot regions. By combining Gaussian Naïve Bayes and DBSCAN, the F-measure of hot region prediction can reach 0.809. Conclusions In this paper, different machine learning models such as Gaussian Naïve Bayes, SVM, Xgboost, Random Forest, and Artificial Neural Network are used to predict hot spots. The experiment results show that the combination of hot spot classification algorithm with higher recall rate and clustering algorithm with higher precision can effectively improve the accuracy of hot region prediction. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04420-0.

interaction surface [4]. That is, hot residues often appear in the form of interaction clusters on the interaction interface, and the residues in these clusters interact with each other to form a stable network structure, which is called the hot region. In drug design, the study of hot region plays a positive role in the prediction of protein functional sites, drug target and protein design [5,6].
The study showed that the hot region in protein-protein interaction is composed of at least three hot spot residues, which on the protein interaction interface. Starting from the machine learning method, Xia [7] based on protein sequence, structure and neighborhood features to extract feature, combined with maximum relevance and minimum redundancy algorithm, and create a hot spot prediction model based on Support Vector Machine (SVM). Then, Tuncbag [8] proposes an empirical model that combines accessible surface area and pairing propensity to predict hot spots residues, which improves the accuracy of hot spot residue prediction. To achieve better property based on structural features, Huang [9] designed an assembly learning method that combines SMOTE with data imbalance to predict hot spot residues. Hu [10] constructed a new learning hot spot prediction model based on protein sequence feature.
A better hot spot residue prediction model is beneficial to the prediction of hot region. Cukuroglu [11] analyzed hot region according to the characteristics of hot spot residues and the formation rules of hot region, and established a hot region database named Hot Region. Pons uses the small-world residue network to predict hot regions, using the small-world network method, and through the relationship between the residues, the residues can form an interconnected network [12]. Nan [13] used complex network and community detection methods to predict hot region in protein interactions. In the process of predicting hot regions, some False Positives (FP) and False Negatives (FN) in the prediction results are corrected by using the topological characteristics of residues in the network, so as to improve the accuracy of predicting hot regions. An approach based on a new clustering algorithm called Local Community Structure Detecting (LCSD) to identify the hot regions was proposed by Lin [14], with an enhanced maximum relevance minimum redundancy algorithm to upgraded prediction performance in the feature selection process of hot spot prediction.
The prediction of hot spot residues in protein interaction is the first step for predicting hot region. It is necessary to identify the hot spot residue as accurately as possible on the protein-protein interaction. Due to the limitation of amino acid mutation to alanine in the data set and the imbalance between hot spots and non-hot spots in the data set, the prediction effect of hot spots and hot regions in protein-protein interaction is not significant. With the release of the SKEMPI2.0 dataset, there were twice as many mutations to alanine in the SKEMPI2.0 dataset as there were in the previous version of SKEMPI1.0 [15,16]. From multiple perspectives, we extracted features according to protein sequence, structure and the relationship between amino acid and built several machines learning models to predict hot spots residues. The hot spot residue prediction results of different machine learning algorithms were analyzed, and DBSCAN clustering algorithm was combined to form hot spots [17]. The experiment results show that the combination of hot spot classification algorithm with higher recall rate and clustering algorithm with higher precision can effectively improve the accuracy of hot region prediction.

Dataset
The datasets used in this article are from up-to-date SKEMPI 2.0 databases (Structural database of Kinetics and Energetics of Mutant Protein Interactions). The dataset encompasses the variation data of thermodynamic parameters and kinetic rate constant parameters before and after the mutation of amino acids to alanine, leucine and other different types of amino acids. The data in the SKEMPI2.0 dataset are all from experiments or authoritative published literature. Due to different experimental environments, mutations at the same site may have multiple different values of binding free energy in the database, so we use the average value of binding free energy to replace the repeated data and eliminate the empty data. After that, 180 protein complexes were obtained from the SKEMPI2.0 database, and the corresponding structural information of each complex was obtained from the PDB database (Protein Data Bank). Each protein complex consists of a stack of interface residues whose accessible surface area is reduced by more than 1 Å during the formation of the protein complex. We defined hot and non-hot residues according to the energy changes in the alanine mutation experiment of these residues.
In SKEMPI2.0 data set, the average value of ∆∆G are used as the final result for the binding free energy of the same site under different experiments. At present, most of the research on hot spot residues adopts such a definition standard: In the alanine experiment, the interface residues with binding free energy change greater than 2 kcal /mol were regarded as hot spot residues, and the interface residues with binding free energy change less than 0.4 kcal /mol were regarded as non-hot spot residues, and the data that the binding free energy varies between 0.4 kcal/mol and 2.0 kcal/mol are discarded. We found that using 2.0 kcal/mol as the hot spot definition standard of the SKEMPI2.0 data set would cause the data to be extremely unbalanced, resulting in a sharp drop in the recall rate of the hot spot prediction model. The strategy of using 1.0 kcal/mol as the threshold can make better use of the entire data set. Therefore, we define more than 1.0 kcal/mol as hot spot residues, and less than 1.0 kcal/mol as nonhot spot residues. Using 1.0 kcal/mol as the standard for defining hot and non-hot spots, we finally obtained 2326 interface residues from 180 protein complexes about SKEMPI2.0 database, including 1513 non-hot and 813 hot spot residues. Table 1 shows the specific distribution of 20 amino acids in the data set, which indicates that amino acids with aromatic side chains are more likely to have hot spots residues and TYR, ARG, LYS, and GLU are easier to exist in the hot spot residues. Otherwise, TYR, SER, ARG, and GLU are more likely to appear in the interface residues.
More than two-fifths of the data in the SKEMPI 2.0 database come from the previous version SKEMPI 1.0. Since a lot of research has been done in SKEMPI 1.0, the main work in this article focuses on the dataset extended by SKEMPI 2.0. In order to enhance the stability of the prediction model, in the SKEMPI 2.0 expansion data, we added the protein complex containing the number of interface residues less than 3 into the training set, and put the remaining data of expansion data the test set. The rest of data in SKEMPI 2.0 is regarded as the testing set. Table 2 shows the detailed data of training set and test set.

Experimental results
In the hot spot prediction stage, we need to predict the hot spot residues as precisely as possible. The optimal feature subset was obtained through feature selection by mRMR (Max-Relevance and Min-Redundancy) algorithm. The calculated score and details for each feature are in Additional file 1. In this section, we used several traditional machine learning methods to build the hot spot prediction model. such as RF (Random Forest), Xgboost (eXtreme Gradient Boosting Decision Tree), ANN (Artificial Neural Network) and SVM (Support Vector Machine) [18][19][20][21], and compared these widely used methods in hot spot research with Gaussian Bayes. For the ANN algorithm, we build a five-layer neural network, in which the activation function of each layer is ReLU (rectified linear units) [22]. Because neural networks have automatic feature fusion functions, in addition to ANN, other machine learning methods all use mRMR feature selection algorithm for feature selection in the feature selection process. Table 3 shows the prediction results of different machine learning algorithms in the hot spot prediction experiment. According to Table 3, it can be concluded that the accuracy and precision of the Xgboost method is the best compared to other methods. The performance of GNB method on recall and F-measure is the highest. The experimental  data showed that GNB could correctly predict more hot spot residues, and Xgboos could correctly predict more non-hot spot residues. Due to hot spot residues are fewer than non-hot spot residues on the training set, the accuracy of GNB is lower than that of Xgboost. The more hot spots residues are predicted, the more hot regions we can get during clustering process. After the hot spot residue prediction, DBSCAN clustering algorithm was used to predict the hot region. The clustering results are presented in Table 4. The F-measure represents the balance between recall and precision, using F-measure as the evaluation criterion, the two parameters "Min" and "ε" in the DBSCAN algorithm are determined by the grid search method. Experiment results show that the hot region prediction model combined with GNB and DBSCAN algorithms is significantly better than other methods. The Fig. 1 showed that the number of hot spot residues correctly predicted by GNB in a single hot region of 47 standard hot regions is close to the performance of other algorithms. The more true positive hot spot residues are predicted, the more hot region will be correctly constructed. Otherwise, few hot spots will cause the recall rate of hot region prediction to decrease. Besides, because of the lack of true positive residues, some hot regions are incorrectly clustered in the process of forming true positive hot regions with relatively unconstrained parameters.

Standard hot regions and predicted hot regions
In this paper, we define the standard hot region according to Keskin [23]. The definition of a standard hot region: Each hot region is composed of at least three hot spot residues, each hot spot residue is assumed to be a regular sphere, and the Cα atom of each hot spot residue is considered to be the center of the sphere. Calculate the radius of the sphere from the volume of the hot spot residue sphere [24]. If the distance between the  centers of two spheres (two Cα-atoms of two hot spots) is less than the sum of the radius of the two spheres plus a tolerance distance (2 Å), the two hot spot residues are flagged to be clustered and to form a network in the hot region. The prediction accuracy of the prediction model for different amino acid mutations is compared with the standard hot regions. Finally, 47 standard hot regions were detected. For detailed results, see Additional file 2.
Because of a hot region contains at least three hot spot residues, and the maximum distance between two contacting amino acids in the same hot region is 9.5 Å, in the process of setting the parameters of "Min" and "ε" of DBSCAN algorithm, we set "Min" to be greater than 3 and set "ε" to be less than 9.5 Å. According to the results of all methods clustering, when "Min" is 3 and "ε" is 9.5 Å, the DBSCAN algorithm could achieves the optimal F-measures in other methods except the GNB. However, under this parameter setting, GNB recognizes more non-hot spots as hot spots. With "Min" set to 4 and "ε" set to 8.5 Å, the GNB could more accurately label non-hot spot as noises. The clustering algorithm will perform well when the factors that form hot regions are improved.

Methods
The hot spot residue that was predicted by different machine learning algorithm are clustered to form hot region with DBSCAN algorithm. Then, we evaluate the hot region prediction by comparing hot regions from predicted models with the standard hot regions in dataset.

Binding free energy changes
Data on single amino acids mutated into alanine was extracted from the SKEMPI 2.0 (https:// life. bsc. es/ pid/ skemp i2/) which contains affinities of wild type complexes and affinities of mutated complexes measured by biological experiments in the scientific literature. Because binding free energy changes of multiple mutated residues has not been accumulated based on such single mutated residues, more samples from multiple mutated samples cannot be deduced. The data with affinity could not be measured were deleted. We calculate the bind free energy of amino acid mutations according to Formula (1), where the binding affinity ( K d ) is determined according to biological experiments such as Surface Plasmon Resonance and Isothermal Titration Calorimetry [25,26], and R is the gas constant (8.314/4184) kcal/(K*mol) (1 kcal = 4.184 kJ) and T is the experimental temperature (in the range of 273 K to 323 K). The change of bind free energy ( G ) can be calculated based on the G mut and G wt , which be calculated from Eqs. (1), (2).

Feature selection
In this paper, the structural information on protein complexes is from PDB (Protein Data Bank) [27]. We extracted features including solvent accessible surface area, protrusion index, relative accessible surface area, binding sites and the depth index with aimed amino acid from PSAIA [28], and calculate RctASA, RctmPI by formula (3), (4), conservation scores from the ConSurf server [29], the attributes of amino acid side chains, the hydrophobic index, and the interaction numbers between two amino acids. The detailed features are shown in Additional file 1. The additional file indicated that the attribution of the amino acid side chain is a discrete variable, we encode the feature with one-hot, which is a feature extraction method that can deal with discontinuous numerical features. In total, we collected 83 features from protein structural, sequence and energy. Since not every feature contributes the same to the model, we determined the optimal feature subset by combining mRMR algorithm (minimum Redundancy Maximum Relevance) [30]; the mutual information I (x, y) is labeled as: and the maximum correlation criterion and the minimum redundancy criterion are defined as: According to the training set, a feature list is produced. To discover the highest F-score combination, we applied incremental feature selection and got a rank of all features in descending order. Every time we combined the feature ranked at the top with its next one to obtain the F-measure in machine learning models, then selected the set of features with the best F-measure.

Prediction of hot regions
There are non-hot spot residues and hot spot residues in the dataset. However, we needed to detect as many hot spots as possible in the hot regions which contain hot spot residues.

Naïve Bayes classifier
We constructed a Gaussian Naïve Bayes classifier given a set of training examples with class labels and then used the model to distinguish between non-hot spot residues and hot spot residues [31][32][33]. One example is a tuple of features (x 1 , x 2 , . . . , x n ) of one sample (x i ) and the class label c of the sample, so X is all samples and C is the classification variable. In our experiment, we assumed that there are two classes: c = 0 (non-hot spot residue) and c = 1 (hot spot residue). According to the Bayes Rule, the probability of one sample E = (x 1 , x 2 , . . . , x n ) , being class c of sample E is: Sample E is classified as class 1 if and only if f b (E) is more than 1, otherwise, sample E will be classified as class 0: Assuming that all features are independent given the value of the class variable, that is the resulting classifier is then: The function f nb (E) is called a Naïve Bayesian classifier, or simply Naïve Bayes (NB). When we used the Gaussian distribution to calculate the p(x i |C) , the classifier becomes Gaussian Naïve Bayes. Given probability distribution is under Gaussian distribution, the function is:

Support vector machine
A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite-dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. Given training dataset D = x 1 , y 1 , . . . , x n , y n the goal of the classification is to find a maximum-margin hyperplane w t x + b = 0.
Used as the output of SVM in binary classification: optimized objective function: where N is sample size, W is the output adjustable parameter vector of support vector machine, K x j , x i is kernel function. The objective function J is to ensure the optimality of classification, and the constraint condition is to ensure the correctness of classification. In order to eliminate the influence of noise and abnormal samples, relaxation variables are introduced as follows:

Xgboost
Xgboost (eXtreme Gradient Boosting) is one type of ensemble learning. The boosting method, by combining multiple weak learners, gives the final learning result. We used the theory of regression tasking to build the optimal Boosting model, regardless of the classification or regression.
The objective function consists of two parts, the first part is used to measure the difference between the predicted score and the real score, and the second part is the regularization term.
The newly generated tree is to fit the residual error predicted last time, that is, when t trees are generated, the prediction score can be written as: where t is the number of leaf nodes, and w is the fraction of leaf nodes. Gamma can control the number of leaf nodes; a lambda can control the number of leaf nodes so they do not get too large as this avoids overfitting.
According to (17) (18) the objective function: in Xgboost is used to approximate it using its Taylor second-order expansion at f t = 0 . Therefore, the objective function is approximate: where g i is a derivative, and h i is the second derivative. The objective function can be simplified as:

Random forest
The advantage of a random forest algorithm is that it combines several weak classifiers into one strong classifier, which can resist the over-fitting of the decision tree by using a voting mechanism. Random forest has a strong generalization ability and high efficiency and accuracy for multidimensional data classification. The Gini coefficient is defined as follows: where p k is the probability that the sample is in class k. The probability of misclassification is (1-p k ). The Gini was calculated for each feature in each sample, and then we selected the feature with the optimal Gini coefficient, theta θ * .
where θ i represents the feature of the i in the sample.

Artificial neural network
An artificial neural network (ANN) is a computational model based on the structure and functions of biological neural networks. Information that flows through the network affects the structure of the ANN because a neural network changes or learns, based on the input and output. We constructed networks containing five layers with activation function rectified linear units (ReLu) and the input layer size corresponds to the 83 features obtained from feature selection.

Evaluation
In this paper, we adopt the following general evaluation indicators to evaluate the performance of the prediction model for hot spots and hot region.
When predicting hot spots, the following notations are used: Similarly, Precision represents the accuracy of the hot region prediction, and Recall represents the coverage of predicted hot regions in standard hot regions. With a good balance between Precision and Recall, the F-measure offers a better overall accuracy in predicting hot regions than using either Precision or Recall solely.

Cluster
Cluster analysis is abbreviated as clustering, which is the process of dividing a data object into subsets. Each subset is a cluster. The clustering process makes the objects in the clusters similar to each other, but not similar to the objects in other clusters. Because the hot spot residues in protein-protein interactions are not evenly distributed on the interface of protein interactions, they are tightly gathered in a dense area. We use the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to predict hot spots based on the characteristic that hot spot residues gather on the protein interaction interface in a flowing structure. The DBSCAN algorithm is suitable to cluster this kind of data. There are two hyperparameters "Min" and "ε" to be measured in this algorithm. "Min" represents the density of residue measured by the number of residues of it, and "ε" represents the radius of residue O as the center of a circle.
For a dataset D composed of residues with the parameters of "Min" and "ε", the residues with more than or equal to "Min" will be regarded as the core residue in its ε-neighborhood. After checking all residues, the core residues and their ε-neighborhood residues will make up the dense regions, which are the clusters we need.
The detailed process about clustering is that all residues should be defined as "unvisited" in the first stage. Then, we need to select randomly one residue p as the center of the circle and calculate the number of residues in its neighborhood to distinguish whether the residue is a core residue or not. If it is a core residue, we labelled the residue as "visited" and selected the neighborhood residues as the next detected objects. If existing core residues are in them, the process continues until the cluster C cannot be extended. Then we return to the beginning, select randomly one residue p in the remaining residues that was labeled as "unvisited" as the center of a circle, and repeat the process until all residues are "visited". Eventually, we will obtain several clusters.

Comparison of prediction results visualization
We assumed that a hot region is predicted correctly only when 60 percent of hot spot residues in the standard hot regions occur in the predicted hot region. Therefore, if the results predicted by GNB are clustered to form a hot region that are regarded as true positive hot regions, the evaluation of the results of other machine learning methods for recall will be reduced, and the increase of mistaken hot regions will lead to the evaluation of precision is decrease. We visualized the hot spot residues and predicted hot spot residues of the protein complex 3HQY with PyMol software [34] in Fig. 2. In the protein complex 3QHY, there are 16 hot spot residues in the standard hot region. The GNB algorithm can correctly predict 15 true positive hot spot residues in the standard hot region and only two non-hot spot residues come within the predicted hot region. In addition to ANN, other models have higher accuracy for non-hot spot residues, but cannot predict more hot spot residues in the standard hot region, so the recall is low. ANN can correctly predict 14 hot spot residues in the standard hot region, but a non-hot spot residue was incorrectly predicted as a hot spot residue.

Conclusion
In this paper, we collect alanine mutations data from the latest presented SKEMPI 2.0 database. When we use 1.0 kcal/mol as the threshold for hot spot and non-hot spot residues, it shows that amino acids of aromatic are more likely to become hot spots residues. Furthermore, hot spot residues are 70.4% from TRP. In the first stage, we used the mRMR algorithm to rank the importance of every feature based on mutual