ECMIS: computational approach for the identification of hotspots at protein-protein interfaces

Background Various methods have been developed to computationally predict hotspot residues at novel protein-protein interfaces. However, there are various challenges in obtaining accurate prediction. We have developed a novel method which uses different aspects of protein structure and sequence space at residue level to highlight interface residues crucial for the protein-protein complex formation. Results ECMIS (Energetic Conservation Mass Index and Spatial Clustering) algorithm was able to outperform existing hotspot identification methods. It was able to achieve around 80% accuracy with incredible increase in sensitivity and outperforms other existing methods. This method is even sensitive towards the hotspot residues contributing only small-scale hydrophobic interactions. Conclusion Combination of diverse features of the protein viz. energy contribution, extent of conservation, location and surrounding environment, along with optimized weightage for each feature, was the key for the success of the algorithm. The academic version of the algorithm is available at http://caps.ncbs.res.in/download/ECMIS/ECMIS.zip. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-303) contains supplementary material, which is available to authorized users.


Background
Protein-protein interactions are vital for many cellular processes like signal transduction, DNA replication, cellular motion, and transport of molecules from one cell to another. Free energy is an important criterion for proteinprotein binding and hence for better understanding of protein-protein interactions. The contribution of various interface residues towards free energy of binding is not uniform [1,2] and the ones which are energetically more important are known as hotspots. Hotspot residues are defined as those which bring changes in the binding free energy by more than 2 kcal/mol, when mutated to alanine [3]. These residues are generally seen to exist in clusters known as 'hot regions' [4]. Such hotspot regions provide stability to the protein complexes and also attribute specificity to their binding sites [1,2,5]. Alanine Scanning Energetic Database (ASEdb) [6] contains a list of hotspots from some selected proteins where they were mutated to alanine and changes in free energy of binding were recorded. Binding Interface Database (BID) [7] is another database which collects information on hotspot residues from literature studies.
The amino acid compositions of hotspot and nonhotspot residues are slightly different [2]. Residues like Tyr, Arg and Trp have higher tendency to be a hotspot residue, because of their size and conformation [2], while other residues like Leu, Thr, Ser and Val are less prevalent [2,5]. Asp and Asn have been observed to contribute critically to hotspots, more frequently than Glu and Gln. This might be attributed to the differences in their side chain conformational entropy [2,5]. Some studies have also indicated that hotspot residues are more conserved than non-hotspot residues [8,9]. Hotspot residues have been observed to be surrounded by residues which are moderately conserved [4] and play a part in occluding the bulk solvent from the hotspots [10,11]. This occlusion of hotspot residues from the bulk solvent is found to be the major reason for their highly efficient interactions with other residues at the interface. Residues at protein-protein interfaces have been studied for their conserved nature [12][13][14][15] and those residues which are structurally and functionally [16] important tend to remain evolutionarily conserved or mutate at a slower pace as compared to the rest of the protein. Further studies maintain that conserved residues remain highly buried in the protein surface [17,18]. Hotspot residues have been found to correlate well with the conserved residues at the interfaces [17,19] and found to be buried and tightly packed within the interface [4].
Since identification of hotspot residues by experimental methods like alanine scanning mutagenesis [20], alanine shaving [21] and residue grafting [21] is both expensive and time consuming, their characteristics have been greatly exploited by a number of computational methods which can predict and identify these hotspot residues from the interface ones. In recent years, several computational methods have been developed which uses one or more characteristics of hotspots, as described above, to identify and successfully predict them from the set of interface residues.
ROBETTA [22] uses a simple physical model which measures changes in binding energy of the complex when a residue is mutated to Alanine. It was applied on a large dataset obtained from ProTherm and ASEdb. FOLDEF [23] uses atomic descriptors of protein structures and various energy terms weighted based on empirical data as obtained from experimental data. It was trained on 339 mutations as obtained from 9 different proteins and the various parameters were optimised. This was then tested on 667 mutations from 82 protein-protein complexes.
There are several machine learning methods available e.g. KFC [24] uses a machine learning approach to characterize its local structural environment and then compare it with the environments of experimentally determined hotspots. If the environment of the interface residue resembles the experimentally determined hotspots, then it is predicted as a hotspot. The method was trained on 249 experimentally characterised mutations from 16 non-redundant protein-protein complexes and tested on an independent test dataset of 112 mutations. MINERVA [25] uses a support vector machine (SVM) based approach, wherein various structure, sequence and molecular interaction parameters are used to predict hotspots. HotPoint [26] is based on an empirical model which uses features like solvent occlusion and knowledge-based pair potential of residues to predict hotspots. KFC2 [27] uses a SVM-based approach, wherein solvent accessibility and local plasticity of the residues are used as features to predict hotspots. Most of these methods are trained on a subset of Alanine Scanning Energetic Database (ASEdb) and tested independently on a dataset obtained from Binding Interface Database (BID).
Other methods use features like solvent accessibility [28][29][30], atomic contacts [31], restricted mobility [17], location in the interaction patch [4], structural conservation [32], sequence conservation [29,[33][34][35], sequence environment and evolutionary profile [36], and pattern mining [37] to identify hotspot residues. Although these methods alone provide reasonable information about the hotspot residues, it has been observed that these cannot be used for the prediction/identification of hotspot residues with high accuracy [38]. Some of the methods employ energy functions [23,39] while others use molecular dynamics simulations [40]. Various machine learning approaches [3,24,25,[41][42][43], based on geometry and biochemical features of residue-residue contacts across binding interfaces, have also been developed to identify hotspot residues. Simple empirical method based on residue-residue pairwise potentials and surface accessibility [26], and a different method which uses protein docking tools [44], have also been developed which identifies hotspot residues with fairly good accuracy. Robetta [22] was one of the first methods developed to identify hotspot residues, which accounted for energies of packing interactions, hydrogen bonds and solvation [45]. Molecular dynamics (MD) simulations have also been used and found to provide good predictive results for hotspot prediction [46]. However, MD simulations cannot be used for large scale prediction of hotspot residues, since they are computationally very intensive.
In this paper, we present a new method "Energetic Conservation Mass Index and Spatial Clustering" (ECMIS) which uses a combination of interface energetic (noncovalent interactions like hydrogen bonds, Van der Waals and electrostatics), residue conservation, massindex and spatial clustering to predict hotspot residues with higher accuracy than any of the other methods available. ECMIS considers most essential and carefully selected distinguishing features of hotspot residues, along with optimum weightage, to calculate combined score for each position. Hence, ECMIS was able to achieve high sensitivity compared to other methods.

Method
Dataset a) Training set A dataset of 316 alanine-mutated interface residues (Additional file 1) derived from 19 protein complexes was taken from ASEdb [6]. Residues in the dataset corresponding to a binding free energy equal to or higher than 2.0 kcal/mol were alone considered as a hotspot residues. The interface residues with binding free energy less than 0.4 kcal/mol were considered as non-hotspot residues, as described by Tuncbag et al. [30] and Xia et al. [47]. Other interface residues with binding free energy between 0.4 and 2.0 kcal/mol were excluded from the training set, in order to better discriminate between hotspots and non-hotspots. The final training dataset comprised of 78 hotspot residues and 119 non-hotspot residues. The program was optimized based on the prediction accuracy of the hotspots in this dataset with varying parameters. The entry 1DN2 has been removed from the dataset, since the protein is complexed with an artificial peptide and therefore the conservation based scores cannot be applied. b) Test set An independent test set from the BID database [7] (Additional file 2) was used to further assess the performance of our proposed method. The residues in BID database, are categorized as 'strong', 'intermediate', 'weak' or 'insignificant' based on the effect of the mutation. The residues labeled as 'strong' were considered as true hotspot and the other residues are considered as non-hotspots. As a result, the test set contained 125 alanine-mutated interface residues in 18 protein complexes with 38 hotspots and 87 non-hotspots.

Dataset for calculation of energy ranges
PPCheck is a program used for calculating energies at protein-protein interfaces and the energy ranges have been benchmarked earlier on 246 complexes (Sukhwal and Sowdhamini, 2013) [48]. These PDB complexes were obtained at a resolution of 2.5 Å or better, constituting 270 protein-protein interfaces (water excluded from interface) in order to define the energy ranges for the three energy components viz. electrostatic-energy, Van der Waals interaction energy and hydrogen bond energy. This benchmarking dataset had included homodimers, heterodimers, transient and permanent complexes, antigen-antibody complexes, etc. [48].

Energy scoring scheme
The energy contribution per residue was examined, as reported in PPCheck. Energy values from PPCheck involve three energy components viz. electrostatics, Van der Waals interactions and hydrogen bond energy. Further scripting was done to extract energy values in a residue-centric manner. Energy component for each residue was weighted to calculate final energy score. These weights were decided based on the application and performance on training dataset (Additional files 3 and 4).
Where e i T = Total binding energy contributed by i th residue e i vw = Van der Waals interaction energy contributed by i th residue e i ES = Electrostatic interaction energy contributed by i th residue e i HB = Hydrogen bond energy contributed by i th residue w VW = Optimized weight for Van der Waals interaction energy w ES = Optimized weight for electrostatic interaction energy w HB = Optimized weight for hydrogen bond energy Energy per residue was then normalized with respect to the volume of the residue to reduce bias due to size of the interacting residues.
Where ne i T = Volume normalized total interaction energy contributed by i th residue V i = Volume of i th residue These volume-normalized scores were further normalized using observed energy ranges of all component energies.
Where E i = Final binding energy of i th residue normalized between 0-1 ne max = Maximum volume normalized interaction energy observed in large dataset of different protein complexes

Conservation score
Along with energy score, the extent of evolutionary conservation for each residue was calculated. First, homologues were searched using PSI-BLAST [49] tool and homologues having blast identity more than 30% were chosen. Further, redundancy amongst homologous sequences was addressed by applying a filter at 80% sequence identity by using CD-HIT [50]. The threshold of 80% was found as the best value to remove highly similar sequences as well as maintaining optimum number of homologues required for accurate multiple sequence alignment (Additional file 5). This was performed to calculate the conservation score without any bias due to closely related sequences. All the homologues, along with query, were aligned using ClustalW [51] software. Each position was then individually checked for conservation and assigned a conservation score as per Johnson and Overington matrix [52]. This matrix was derived using structure-based sequence alignment of homologous protein families. A similar approach was used earlier in Smotif [53] algorithm, which proved to be very efficient in finding structural motifs.
Where c i = Normalized total conservation score of i th position a = Residue type present in homologous sequence at i th position in multiple sequence alignment b = Residue type present in another homologous sequence at i th position in multiple sequence alignment S ab = Amino acid substitution score residue type "a" substituted by residue type "b" from Birkbeck matrix n = Total number of homologues present in the multiple sequence alignment All scores were further normalized by 100 (maximum possible score i.e. cysteine-cysteine substitution score in Johnson and Overington matrix [54]).

Mass index score
For each interface residue, sum of mass of interacting residues were calculated as Mass Index score.
Where mi i = mass index of i th residue m i = mass of i th residue m j = mass of j th residue j = j th residue interacting with i th residue n = Total number of residues interacting with i th residue mi max = Maximum mass index in large dataset of different protein complexes MI i = mass index of i th residue normalized between 0-1

Spatial clustering
Hotspot residues could cluster spatially and forms hot-regions [4]. This fact was used to further enhance score of those hotspot residues which forms very efficient and conserved binding patch. To achieve this, average of energy and conservation scores was referred. If this average score for any residue exceeds 0.5, then its score will be further enhanced with respect number of other hotspot residues within 7 Å spatial proximity.
Where sc i = Spatial cluster score for i th residue n i intra = Number of residues present within same protomer within 7 Å distance of i th residue n i inter = Number of residues present within interacting protomer within 7 Å distance of i th residue sc max = Maximum spatial cluster score observed in large dataset of different protein complexes SC i = Final spatial cluster score for i th residue normalized between 0-1

Final score
Final score was calculated by combining energy score, conservation score and spatial clustering score. Each subscore was weighted according to their importance in identifying hotspots. These weights were applied along with threshold score (decided using ROC plots) to decide hotspot criteria and were empirically optimized based on the minimization of residual error in the prediction using training dataset. Here, the reduction in the number of false positives and false negatives were considered as optimization function.
Where f i = Final combined score of i th residue w E = Optimized weight for energy score w C = Optimized weight for conservation score w SC = Optimized weight for spatial clustering score w MI = Optimized weight for mass index f max = Maximum combined score observed in data F i = Final combined score of i th residue normalized between 0-1

Performance evaluation
In order to assess the performance of classification methods, commonly used measures such as prediction accuracy (ACC), sensitivity (SE), precision (PR), specificity (SP) and Mathews Correlation Coefficient (MCC) were used. These measurements are defined as Where TP, FP, TN and FN represent true positive (correctly predicted hotspot residue), false positive (nonhotspot residue incorrectly predicted as hotspot), true negative (correctly predicted non-hotspot residue) and false negative (hotspot residue incorrectly predicted as non-hotspot), respectively.

Results and discussions
Optimization of the parameters of ECMIS All the parameters viz. weights for each type of component energy (vizN electrostatic energy, Van der Waals energy and hydrogen bonding), weights for each type of score and threshold hotspot score were optimized empirically. The maximum accuracy was achieved using optimized values of these parameters, mentioned in Table 1).
The best value for the discriminative threshold scores ( Table 2) for all scoring schemes were decided after consulting respective ROC plots (Figure 1). ROC -Receiver operating characteristics plot is one of the methods which can be used to decide a threshold value for the given parameter at which an optimum performance for the algorithm can be achieved. ROC curve graphically represents gain in true positive rate with the expense of false positive rate. The point after which increase in true positive rate is smaller compared to increase in false positive rate selected as a threshold value shown in red ( Figure 1; Additional file 6).
While optimizing weights for individual component energy, it was observed that hydrogen-bond energy was always over-represented in case of threonine and serine. Hence their mass-index values (MI Ser > 0.5, MI Thr > 0.5) was considered as additional criteria to reduce false positive in case of serine or threonine residues. In contrast tryptophane and phenylalanine mostly contributed in Van der Waals energy which is further normalized by their volume. Compared to other types of energies, magnitude of Van der Waals energy was very small while volume of phenylalanine and tryptophan was high compared to other amino acids. Therefore these residues always get smaller score irrespective of their importance in protein-protein complex formation. To overcome this problem again mass index score (MI Trp > 0.5, MI Phe > 0.5) were consulted to improve scores of true positive tryptophan and phenylalanine residues.

Normalization of scores
In order to compare scores obtained for one protein complex with another protein complex all scores were normalized using maximum value observed for each parameter in dataset of diverse protein complexes (Figure 2). These ranges were decided after considering 95% of the data and extreme 5% were ignored.

Prediction of the independent test set
The optimized parameters were used for the identification of hotspot residues in the independent test dataset from the BID database. Our algorithm was able to achieve an accuracy of approximately 80% on the test dataset for optimized set of weights ( Table 1).

Comparison of the method with other methods
ECMIS was compared with various other methods available for the identification of hotspot residues Robetta [22] and FOLDEF [23], decision tree methods such as KFC [24] and three recently published methods MINERVA [25], HotPoint [26], KFC2 [27] and random forest based methods [43]. An independent test set from the BID database with 125 alanine-mutated interface residues in 18 protein complexes with 38 hotspots and 87 non-hotspots was used. The performance of these methods on the test set is listed in Table 3. ECMIS performs better than the currently available methods with an accuracy of 80% and MCC of 0.524.    Case studies a) Colicin endonuclease-Im9 complex A random protein from the PDB was chosen to demonstrate the performance of our hotspot identification algorithm. Colicin endonucleases (DNases) are bound and inactivated by immunity (Im) proteins. A number of hotspot residues have been identified by mutagenesis which affects the binding of the DNase-1 m9 complex ((PDBID: 2VLQ). It has been shown that the mutation of three Im9 residues of helix III Asp51, Tyr54 and Tyr55 to alanine generates change in the energy values of ΔΔG > 5 kcal/mol) [54]. In the case of E9 DNase three important residues (Asn75, Phe86 and Lys97) form a central belt on the surface of the enzyme that comprises the hotspot. Additionally the salt-bridge between Glu41 of Im9 with Lys97 of the E9 DNase has been shown to be a specificity contact in this complex [55]. Arg54 and Asn72 of E9 DNase have also been found to effect the binding of DNase to Im9 protein. It was observed that ECMIS was able to pick 2 out of the 5 hotspot residues in Dnase (Lys97 and Asn72) and 3 out of the 4 hotspots in the 1 m9 protein (Figure 3). Since this complex is F86A mutant of the Dnase-1 m9 complex the Phe86 of DNase and its interacting partner Tyr55 of 1 m9 were not picked up by our program. b) Subtilisin BPN' -Chymotrypsin inhibitor 2 Chymotrypsin inhibitor 2 (CI2) inhibits the serine protease subtilisin by binding to its active site (PDBID: ITM1). A series of mutants have been found to affect the binding of chymotrypsin inhibitor to subtilisin. It has been shown that the network of hydrogen bonds and electrostatic interactions connecting the CI2 binding loop to the  protein core provides structural integrity and conformational stability relevant both for binding affinity and for control of inhibitor religation. The H-bond between Thr-58 and Glu-60 bridges the cleavage site, while the interactions between Gly-83, Arg-65, and Glu-60 tie the leaving group R'-peptide tightly to the protein core, assisting in leaving group retention and accelerating the religation reaction. It has also been shown that mutation of Arg-62, a peripheral participant in the H-bonding network, has comparatively little effect on hydrolysis or inhibition, while mutation of Arg-67 has an intermediate effect [56]. Similarly the importance of Met-59 and Tyr-61 has been described in [57]. Among the above described hotspot residues our program was able to predict five out of eight reported residues (Figure 4).

Conclusion
Protein-protein interaction hotspot refers to a residue or cluster of residues that makes a major contribution to the binding free energy of protein-protein complexes, as determined by alanine scanning mutagenesis. These residues serve as important targets in the field of pharmaceutical industry for the impedance of certain protein-protein complexes. A number of recent studies have been successful in developing (drug-like) small molecules that bind at hotspots and inhibit complex formation. Experimental identification of hotspot residues is however expensive and time-consuming, and computational methods can thus be helpful in suggesting residues for possible experimentation. In this paper, we describe a novel algorithm which performs better than the existing methods for the identification of hotspot residues validated using previously established experimental data. The method records the highest accuracy available so far for the prediction of hotspots at protein-protein interaction sites.

Additional files
Additional file 1: Hotspot and non-hotspot residues from the Alanine scanning database used as the training set.
Additional file 2: Hotspot and non-hotspot residues from the BID database used as the independent test set.
Additional file 3: Weight ranges and their respective increments used during optimization process.
Additional file 4: Details on optimization of the weights.
Additional file 5: Availability of homologues at different sequence identity threshold for some PDB entries.