ECMIS: computational approach for the identification of hotspots at protein-protein interfaces
BMC Bioinformatics volume 15, Article number: 303 (2014)
Various methods have been developed to computationally predict hotspot residues at novel protein-protein interfaces. However, there are various challenges in obtaining accurate prediction. We have developed a novel method which uses different aspects of protein structure and sequence space at residue level to highlight interface residues crucial for the protein-protein complex formation.
ECMIS (Energetic Conservation Mass Index and Spatial Clustering) algorithm was able to outperform existing hotspot identification methods. It was able to achieve around 80% accuracy with incredible increase in sensitivity and outperforms other existing methods. This method is even sensitive towards the hotspot residues contributing only small-scale hydrophobic interactions.
Combination of diverse features of the protein viz. energy contribution, extent of conservation, location and surrounding environment, along with optimized weightage for each feature, was the key for the success of the algorithm. The academic version of the algorithm is available at http://caps.ncbs.res.in/download/ECMIS/ECMIS.zip.
Protein-protein interactions are vital for many cellular processes like signal transduction, DNA replication, cellular motion, and transport of molecules from one cell to another. Free energy is an important criterion for protein-protein binding and hence for better understanding of protein-protein interactions. The contribution of various interface residues towards free energy of binding is not uniform [1, 2] and the ones which are energetically more important are known as hotspots. Hotspot residues are defined as those which bring changes in the binding free energy by more than 2 kcal/mol, when mutated to alanine . These residues are generally seen to exist in clusters known as ‘hot regions’ . Such hotspot regions provide stability to the protein complexes and also attribute specificity to their binding sites [1, 2, 5]. Alanine Scanning Energetic Database (ASEdb)  contains a list of hotspots from some selected proteins where they were mutated to alanine and changes in free energy of binding were recorded. Binding Interface Database (BID)  is another database which collects information on hotspot residues from literature studies.
The amino acid compositions of hotspot and non-hotspot residues are slightly different . Residues like Tyr, Arg and Trp have higher tendency to be a hotspot residue, because of their size and conformation , while other residues like Leu, Thr, Ser and Val are less prevalent [2, 5]. Asp and Asn have been observed to contribute critically to hotspots, more frequently than Glu and Gln. This might be attributed to the differences in their side chain conformational entropy [2, 5]. Some studies have also indicated that hotspot residues are more conserved than non-hotspot residues [8, 9]. Hotspot residues have been observed to be surrounded by residues which are moderately conserved  and play a part in occluding the bulk solvent from the hotspots [10, 11]. This occlusion of hotspot residues from the bulk solvent is found to be the major reason for their highly efficient interactions with other residues at the interface. Residues at protein-protein interfaces have been studied for their conserved nature [12–15] and those residues which are structurally and functionally  important tend to remain evolutionarily conserved or mutate at a slower pace as compared to the rest of the protein. Further studies maintain that conserved residues remain highly buried in the protein surface [17, 18]. Hotspot residues have been found to correlate well with the conserved residues at the interfaces [17, 19] and found to be buried and tightly packed within the interface .
Since identification of hotspot residues by experimental methods like alanine scanning mutagenesis , alanine shaving  and residue grafting  is both expensive and time consuming, their characteristics have been greatly exploited by a number of computational methods which can predict and identify these hotspot residues from the interface ones. In recent years, several computational methods have been developed which uses one or more characteristics of hotspots, as described above, to identify and successfully predict them from the set of interface residues.
ROBETTA  uses a simple physical model which measures changes in binding energy of the complex when a residue is mutated to Alanine. It was applied on a large dataset obtained from ProTherm and ASEdb. FOLDEF  uses atomic descriptors of protein structures and various energy terms weighted based on empirical data as obtained from experimental data. It was trained on 339 mutations as obtained from 9 different proteins and the various parameters were optimised. This was then tested on 667 mutations from 82 protein-protein complexes.
There are several machine learning methods available e.g. KFC  uses a machine learning approach to characterize its local structural environment and then compare it with the environments of experimentally determined hotspots. If the environment of the interface residue resembles the experimentally determined hotspots, then it is predicted as a hotspot. The method was trained on 249 experimentally characterised mutations from 16 non-redundant protein-protein complexes and tested on an independent test dataset of 112 mutations. MINERVA  uses a support vector machine (SVM) based approach, wherein various structure, sequence and molecular interaction parameters are used to predict hotspots. HotPoint  is based on an empirical model which uses features like solvent occlusion and knowledge-based pair potential of residues to predict hotspots. KFC2  uses a SVM-based approach, wherein solvent accessibility and local plasticity of the residues are used as features to predict hotspots. Most of these methods are trained on a subset of Alanine Scanning Energetic Database (ASEdb) and tested independently on a dataset obtained from Binding Interface Database (BID).
Other methods use features like solvent accessibility [28–30], atomic contacts , restricted mobility , location in the interaction patch , structural conservation , sequence conservation [29, 33–35], sequence environment and evolutionary profile , and pattern mining  to identify hotspot residues. Although these methods alone provide reasonable information about the hotspot residues, it has been observed that these cannot be used for the prediction/identification of hotspot residues with high accuracy . Some of the methods employ energy functions [23, 39] while others use molecular dynamics simulations . Various machine learning approaches [3, 24, 25, 41–43], based on geometry and biochemical features of residue-residue contacts across binding interfaces, have also been developed to identify hotspot residues. Simple empirical method based on residue-residue pairwise potentials and surface accessibility , and a different method which uses protein docking tools , have also been developed which identifies hotspot residues with fairly good accuracy. Robetta  was one of the first methods developed to identify hotspot residues, which accounted for energies of packing interactions, hydrogen bonds and solvation . Molecular dynamics (MD) simulations have also been used and found to provide good predictive results for hotspot prediction . However, MD simulations cannot be used for large scale prediction of hotspot residues, since they are computationally very intensive.
In this paper, we present a new method “Energetic Conservation Mass Index and Spatial Clustering” (ECMIS) which uses a combination of interface energetic (non-covalent interactions like hydrogen bonds, Van der Waals and electrostatics), residue conservation, mass-index and spatial clustering to predict hotspot residues with higher accuracy than any of the other methods available. ECMIS considers most essential and carefully selected distinguishing features of hotspot residues, along with optimum weightage, to calculate combined score for each position. Hence, ECMIS was able to achieve high sensitivity compared to other methods.
A dataset of 316 alanine-mutated interface residues (Additional file 1) derived from 19 protein complexes was taken from ASEdb . Residues in the dataset corresponding to a binding free energy equal to or higher than 2.0 kcal/mol were alone considered as a hotspot residues. The interface residues with binding free energy less than 0.4 kcal/mol were considered as non-hotspot residues, as described by Tuncbag et al.  and Xia et al. . Other interface residues with binding free energy between 0.4 and 2.0 kcal/mol were excluded from the training set, in order to better discriminate between hotspots and non-hotspots. The final training dataset comprised of 78 hotspot residues and 119 non-hotspot residues. The program was optimized based on the prediction accuracy of the hotspots in this dataset with varying parameters. The entry 1DN2 has been removed from the dataset, since the protein is complexed with an artificial peptide and therefore the conservation based scores cannot be applied.
An independent test set from the BID database  (Additional file 2) was used to further assess the performance of our proposed method. The residues in BID database, are categorized as ‘strong’, ‘intermediate’, ‘weak’ or ‘insignificant’ based on the effect of the mutation. The residues labeled as ‘strong’ were considered as true hotspot and the other residues are considered as non-hotspots. As a result, the test set contained 125 alanine-mutated interface residues in 18 protein complexes with 38 hotspots and 87 non-hotspots.
Dataset for calculation of energy ranges
PPCheck is a program used for calculating energies at protein-protein interfaces and the energy ranges have been benchmarked earlier on 246 complexes (Sukhwal and Sowdhamini, 2013) . These PDB complexes were obtained at a resolution of 2.5 Å or better, constituting 270 protein-protein interfaces (water excluded from interface) in order to define the energy ranges for the three energy components viz. electrostatic-energy, Van der Waals interaction energy and hydrogen bond energy. This benchmarking dataset had included homodimers, heterodimers, transient and permanent complexes, antigen-antibody complexes, etc..
Energy scoring scheme
The energy contribution per residue was examined, as reported in PPCheck. Energy values from PPCheck involve three energy components viz. electrostatics, Van der Waals interactions and hydrogen bond energy. Further scripting was done to extract energy values in a residue-centric manner. Energy component for each residue was weighted to calculate final energy score. These weights were decided based on the application and performance on training dataset (Additional files 3 and 4).
Where = Total binding energy contributed by ith residue
= Van der Waals interaction energy contributed by ith residue
= Electrostatic interaction energy contributed by ith residue
= Hydrogen bond energy contributed by ith residue
w VW = Optimized weight for Van der Waals interaction energy
w ES = Optimized weight for electrostatic interaction energy
w HB = Optimized weight for hydrogen bond energy
Energy per residue was then normalized with respect to the volume of the residue to reduce bias due to size of the interacting residues.
Where = Volume normalized total interaction energy contributed by ith residue
V i = Volume of ith residue
These volume-normalized scores were further normalized using observed energy ranges of all component energies.
Where Ei = Final binding energy of ith residue normalized between 0–1
nemax = Maximum volume normalized interaction energy observed in large dataset of different protein complexes
Along with energy score, the extent of evolutionary conservation for each residue was calculated. First, homologues were searched using PSI-BLAST  tool and homologues having blast identity more than 30% were chosen. Further, redundancy amongst homologous sequences was addressed by applying a filter at 80% sequence identity by using CD-HIT . The threshold of 80% was found as the best value to remove highly similar sequences as well as maintaining optimum number of homologues required for accurate multiple sequence alignment (Additional file 5). This was performed to calculate the conservation score without any bias due to closely related sequences. All the homologues, along with query, were aligned using ClustalW  software. Each position was then individually checked for conservation and assigned a conservation score as per Johnson and Overington matrix . This matrix was derived using structure-based sequence alignment of homologous protein families. A similar approach was used earlier in Smotif  algorithm, which proved to be very efficient in finding structural motifs.
Where ci = Normalized total conservation score of ith position
a = Residue type present in homologous sequence at ith position in multiple sequence alignment
b = Residue type present in another homologous sequence at ith position in multiple sequence alignment
Sab = Amino acid substitution score residue type “a” substituted by residue type “b” from Birkbeck matrix
n = Total number of homologues present in the multiple sequence alignment
All scores were further normalized by 100 (maximum possible score i.e. cysteine-cysteine substitution score in Johnson and Overington matrix ).
Ci= Final conservation score normalized between 0–1
Mass index score
For each interface residue, sum of mass of interacting residues were calculated as Mass Index score.
Where mii = mass index of ith residue
mi = mass of ith residue
mj = mass of jth residue
j = jth residue interacting with ith residue
n = Total number of residues interacting with ith residue
mimax = Maximum mass index in large dataset of different protein complexes
MIi = mass index of ith residue normalized between 0–1
Hotspot residues could cluster spatially and forms hot-regions . This fact was used to further enhance score of those hotspot residues which forms very efficient and conserved binding patch. To achieve this, average of energy and conservation scores was referred. If this average score for any residue exceeds 0.5, then its score will be further enhanced with respect number of other hotspot residues within 7 Å spatial proximity.
Where sci = Spatial cluster score for ith residue
= Number of residues present within same protomer within 7 Å distance of ith residue
= Number of residues present within interacting protomer within 7 Å distance of ith residue
scmax = Maximum spatial cluster score observed in large dataset of different protein complexes
SCi= Final spatial cluster score for ith residue normalized between 0–1
Final score was calculated by combining energy score, conservation score and spatial clustering score. Each subscore was weighted according to their importance in identifying hotspots. These weights were applied along with threshold score (decided using ROC plots) to decide hotspot criteria and were empirically optimized based on the minimization of residual error in the prediction using training dataset. Here, the reduction in the number of false positives and false negatives were considered as optimization function.
Where fi = Final combined score of ith residue
w E = Optimized weight for energy score
w C = Optimized weight for conservation score
w SC = Optimized weight for spatial clustering score
w MI = Optimized weight for mass index
fmax = Maximum combined score observed in data
Fi= Final combined score of ith residue normalized between 0–1
In order to assess the performance of classification methods, commonly used measures such as prediction accuracy (ACC), sensitivity (SE), precision (PR), specificity (SP) and Mathews Correlation Coefficient (MCC) were used. These measurements are defined as
Where TP, FP, TN and FN represent true positive (correctly predicted hotspot residue), false positive (non-hotspot residue incorrectly predicted as hotspot), true negative (correctly predicted non-hotspot residue) and false negative (hotspot residue incorrectly predicted as non-hotspot), respectively.
Results and discussions
Optimization of the parameters of ECMIS
All the parameters viz. weights for each type of component energy (vizN electrostatic energy, Van der Waals energy and hydrogen bonding), weights for each type of score and threshold hotspot score were optimized empirically. The maximum accuracy was achieved using optimized values of these parameters, mentioned in Table 1).
The best value for the discriminative threshold scores (Table 2) for all scoring schemes were decided after consulting respective ROC plots (Figure 1). ROC – Receiver operating characteristics plot is one of the methods which can be used to decide a threshold value for the given parameter at which an optimum performance for the algorithm can be achieved. ROC curve graphically represents gain in true positive rate with the expense of false positive rate. The point after which increase in true positive rate is smaller compared to increase in false positive rate selected as a threshold value shown in red (Figure 1; Additional file 6).
While optimizing weights for individual component energy, it was observed that hydrogen-bond energy was always over-represented in case of threonine and serine. Hence their mass-index values (MISer > 0.5, MIThr > 0.5) was considered as additional criteria to reduce false positive in case of serine or threonine residues. In contrast tryptophane and phenylalanine mostly contributed in Van der Waals energy which is further normalized by their volume. Compared to other types of energies, magnitude of Van der Waals energy was very small while volume of phenylalanine and tryptophan was high compared to other amino acids. Therefore these residues always get smaller score irrespective of their importance in protein-protein complex formation. To overcome this problem again mass index score (MITrp > 0.5, MIPhe > 0.5) were consulted to improve scores of true positive tryptophan and phenylalanine residues.
Normalization of scores
In order to compare scores obtained for one protein complex with another protein complex all scores were normalized using maximum value observed for each parameter in dataset of diverse protein complexes (Figure 2). These ranges were decided after considering 95% of the data and extreme 5% were ignored.
Prediction of the independent test set
The optimized parameters were used for the identification of hotspot residues in the independent test dataset from the BID database. Our algorithm was able to achieve an accuracy of approximately 80% on the test dataset for optimized set of weights (Table 1).
Comparison of the method with other methods
ECMIS was compared with various other methods available for the identification of hotspot residues Robetta  and FOLDEF , decision tree methods such as KFC  and three recently published methods MINERVA , HotPoint , KFC2  and random forest based methods . An independent test set from the BID database with 125 alanine-mutated interface residues in 18 protein complexes with 38 hotspots and 87 non-hotspots was used. The performance of these methods on the test set is listed in Table 3. ECMIS performs better than the currently available methods with an accuracy of 80% and MCC of 0.524.
Colicin endonuclease-Im9 complex
A random protein from the PDB was chosen to demonstrate the performance of our hotspot identification algorithm. Colicin endonucleases (DNases) are bound and inactivated by immunity (Im) proteins. A number of hotspot residues have been identified by mutagenesis which affects the binding of the DNase-1 m9 complex ((PDBID: 2VLQ). It has been shown that the mutation of three Im9 residues of helix III Asp51, Tyr54 and Tyr55 to alanine generates change in the energy values of ΔΔG > 5 kcal/mol) . In the case of E9 DNase three important residues (Asn75, Phe86 and Lys97) form a central belt on the surface of the enzyme that comprises the hotspot. Additionally the salt-bridge between Glu41 of Im9 with Lys97 of the E9 DNase has been shown to be a specificity contact in this complex . Arg54 and Asn72 of E9 DNase have also been found to effect the binding of DNase to Im9 protein. It was observed that ECMIS was able to pick 2 out of the 5 hotspot residues in Dnase (Lys97 and Asn72) and 3 out of the 4 hotspots in the 1 m9 protein (Figure 3). Since this complex is F86A mutant of the Dnase-1 m9 complex the Phe86 of DNase and its interacting partner Tyr55 of 1 m9 were not picked up by our program.
Subtilisin BPN’ – Chymotrypsin inhibitor 2
Chymotrypsin inhibitor 2 (CI2) inhibits the serine protease subtilisin by binding to its active site (PDBID: ITM1). A series of mutants have been found to affect the binding of chymotrypsin inhibitor to subtilisin. It has been shown that the network of hydrogen bonds and electrostatic interactions connecting the CI2 binding loop to the protein core provides structural integrity and conformational stability relevant both for binding affinity and for control of inhibitor religation. The H-bond between Thr-58 and Glu-60 bridges the cleavage site, while the interactions between Gly-83, Arg-65, and Glu-60 tie the leaving group R‘-peptide tightly to the protein core, assisting in leaving group retention and accelerating the religation reaction. It has also been shown that mutation of Arg-62, a peripheral participant in the H-bonding network, has comparatively little effect on hydrolysis or inhibition, while mutation of Arg-67 has an intermediate effect . Similarly the importance of Met-59 and Tyr-61 has been described in . Among the above described hotspot residues our program was able to predict five out of eight reported residues (Figure 4).
Protein-protein interaction hotspot refers to a residue or cluster of residues that makes a major contribution to the binding free energy of protein-protein complexes, as determined by alanine scanning mutagenesis. These residues serve as important targets in the field of pharmaceutical industry for the impedance of certain protein-protein complexes. A number of recent studies have been successful in developing (drug-like) small molecules that bind at hotspots and inhibit complex formation. Experimental identification of hotspot residues is however expensive and time-consuming, and computational methods can thus be helpful in suggesting residues for possible experimentation. In this paper, we describe a novel algorithm which performs better than the existing methods for the identification of hotspot residues validated using previously established experimental data. The method records the highest accuracy available so far for the prediction of hotspots at protein-protein interaction sites.
Energetic, conservation, mass index and spatial clustering
Receiver operating characteristics plot
Alanine scanning energetic database
Binding interface database
Clackson T, Wells JA: A hot spot of binding energy in a hormone-receptor interface. Science. 1995, 1995 (267): 383-386.
Bogan AA, Thorn KS: Anatomy of hot spots in protein interfaces. J Mol Biol. 1998, 1998 (280): 1-9.
Lise S, Archambeau C, Pontil M, Jones DT: Prediction of hot spot residues at protein–protein interfaces by combining machine learning and energy-based methods. BMC Bioinformatics. 2009, 10: 365-10.1186/1471-2105-10-365.
Keskin O, Ma B, Nussinov R: Hot regions in protein–protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol. 2005, 345: 1281-1294. 10.1016/j.jmb.2004.10.077.
Moreira IS, Fernandes PA, Ramos MJ: Hot spots—a review of the protein–protein interface determinant amino-acid residues. Proteins. 2007, 68: 803-812. 10.1002/prot.21396.
Thorn KS, Bogan AA: ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics. 2001, 17: 284-285. 10.1093/bioinformatics/17.3.284.
Fischer TB, Arunachalam KV, Bailey D, Mangual V, Bakhru S, Russo R, Huang D, Paczkowski M, Lalchandani V, Ramachandra C, Ellison B, Galer S, Shapley J, Fuentes E, Tsai J: The binding interface database (BID): a compilation of amino acid hot spots in protein interfaces. Bioinformatics. 2003, 19: 1453-1454. 10.1093/bioinformatics/btg163.
Burgoyne N, Jackson R: Predicting protein interaction sites: binding hotspots in protein-protein and protein-ligand interfaces. Bioinformatics. 2006, 22 (11): 1335-1342. 10.1093/bioinformatics/btl079.
Guharoy M, Chakrabarti P: Conservation and relative importance of residues across protein-protein interfaces. Proc Natl Acad Sci. 2005, 102 (43): 15447-15452. 10.1073/pnas.0505425102.
Li J, Liu Q: ‘Double water exclusion’: a hypothesis refining the O-ring theory for the hot spots at protein interfaces. Bioinformatics. 2009, 25 (6): 743-750. 10.1093/bioinformatics/btp058.
Liu Q, Li J: Propensity vectors of low-ASA residue pairs in the distinction of protein interactions. Proteins. 2010, 78 (3): 589-602.
Grishin NV, Phillips MA: The subunit interfaces of oligomeric enzymes are conserved to a similar extent to the overall protein sequences. Protein Sci. 1994, 3: 2455-2458. 10.1002/pro.5560031231.
Valdar WS, Thornton JM: Protein-protein interfaces: analysis of amino acid conservation in homodimers. Proteins. 2001, 42: 108-124. 10.1002/1097-0134(20010101)42:1<108::AID-PROT110>3.0.CO;2-O.
Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science. 2002, 296: 750-752. 10.1126/science.1068696.
Caffrey DR, Somaroo S, Hughes JD, Mintseris J, Huang HS: Are protein-protein interfaces more conserved in sequence than the rest of the protein surface?. Protein Sci. 2004, 13: 190-202. 10.1110/ps.03323604.
Panchenko AR, Kondrashov F, Bryant S: Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci. 2004, 13: 884-892. 10.1110/ps.03465504.
Yogurtcu ON, Erdemli SB, Nussinov R, Turkay M, Keskin O: Restricted mobility of conserved residues in protein-protein interfaces in molecular simulations. Biophys J. 2008, 94: 3475-3485. 10.1529/biophysj.107.114835.
Kim J, Mao J, Gunner MR: Are acidic and basic groups in buried proteins predicted to be ionized?. J Mol Biol. 2005, 348: 1283-1298. 10.1016/j.jmb.2005.03.051.
Glaser F, Pupko T, Paz I, Bell RE, Bechor-Shental D, Martz E, Ben-Tal N: ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics. 2003, 19: 163-164. 10.1093/bioinformatics/19.1.163.
Wells JA: Systematic mutational analyses of protein-protein interfaces. Methods Enzymol. 1991, 202: 390-411.
Jin L, Wells JA: Dissecting the energetics of an antibody-antigen interface by alanine shaving and molecular grafting. Protein Sci. 1994, 3: 2351-2357. 10.1002/pro.5560031219.
Kortemme T, Baker D: A simple physical model for binding energy hot spots in protein-protein complexes. Proc Natl Acad Sci U S A. 2002, 99: 14116-14121. 10.1073/pnas.202485799.
Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002, 320: 369-387. 10.1016/S0022-2836(02)00442-4.
Darnell SJ, LeGault L, Mitchell JC: KFC Server: interactive forecasting of protein interaction hot spots. Nucleic Acids Res. 2008, 36 (Web Server issue): W265-W269.
Cho KI, Kim D, Lee D: A feature-based approach to modelling protein–protein interaction hot spots. Nucleic Acids Res. 2009, 37: 2672-2687. 10.1093/nar/gkp132.
Tuncbag N, Keskin O, Gursoy A: HotPoint: hot spot prediction server for protein interfaces. Nucleic Acids Res. 2010, 38 (Suppl): W402-W406.
Zhu X, Mitchell JC: KFC2: a knowledge-based hot spot prediction method based on interface solvation, atomic density, and plasticity features. Proteins. 2011, 79: 2671-2683. 10.1002/prot.23094.
Landon MR, Lancia DR, Yu J, Thiel SC, Vajda S: Identification of hot spots within druggable binding regions by computational solvent mapping of proteins. J Med Chem. 2007, 50: 1231-1240. 10.1021/jm061134b.
Guney E, Tuncbag N, Keskin O, Gursoy A: HotSprint: database of computational hot spots in protein interfaces. Nucleic Acids Res. 2008, 36: D662-D666.
Tuncbag N, Gursoy A, Keskin O: Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy. Bioinformatics. 2009, 25: 1513-1520. 10.1093/bioinformatics/btp240.
Li L, Zhao B, Cui Z, Gan J, Sakharkar MK, Kangueane P: Identification of hot spot residues at protein-protein interface. Bioinformation. 2006, 1: 121-126. 10.6026/97320630001121. 18
Li X, Keskin O, Ma B, Nussinov R, Liang J: Protein-protein interactions: hot spots and structurally conserved residues often locate in complemented pockets that pre-organized in the unbound states: implications for docking. J Mol Biol. 2004, 344: 781-795. 10.1016/j.jmb.2004.09.051.
Hu Z, Ma B, Wolfson H, Nussinov R: Conservation of polar residues as hot spots at protein interfaces. Proteins. 2000, 39: 331-342. 10.1002/(SICI)1097-0134(20000601)39:4<331::AID-PROT60>3.0.CO;2-A.
Ma B, Elkayam T, Wolfson H, Nussinov R: Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc Natl Acad Sci U S A. 2003, 100: 5772-5777. 10.1073/pnas.1030237100.
Ma B, Nussinov R: Trp/Met/Phe hot spots in protein-protein interactions: potential targets in drug design. Curr Top Med Chem. 2007, 7: 999-1005. 10.2174/156802607780906717.
Ofran Y, Rost B: Protein-protein interaction hotspots carved into sequences. PLoS Comput Biol. 2007, 3: e119-10.1371/journal.pcbi.0030119.
Hsu CM, Chen CY, Liu BJ, Huang CC, Laio MH, Lin CC, Wu TL: Identification of hot regions in protein-protein interactions by sequential pattern mining. BMC Bioinformatics. 2007, 8 (Suppl. 5): S8-
DeLano WL: Unraveling hot spots in binding interfaces: progress and challenges. Curr Opin Struct Biol. 2002, 12: 14-20. 10.1016/S0959-440X(02)00283-X.
Diller DJ, Humblet C, Zhang X, Westerhoff LM: Computational alanine scanning with linear scaling semiempirical quantum mechanical methods. Proteins. 2010, 78: 2329-2337. 10.1002/prot.22745.
Massova I, Kollman PA: Computational alanine scanning to probe protein-protein interactions: a novel approach to evaluate binding free energies. J Am Chem Soc. 1999, 121: 8133-8139. 10.1021/ja990935j.
Assi SA, Tanaka T, Rabbitts TH, Fernandez-Fuentes N: PCRPi: presaging critical residues in protein interfaces, a new computational tool to chart hot spots in protein interfaces. Nucleic Acids Res. 2010, 38: e86-10.1093/nar/gkp1158.
Darnell SJ, Page D, Mitchell JC: An automated decision-tree approach to predicting protein interaction hot spots. Proteins. 2007, 68: 813-823. 10.1002/prot.21474.
Wang L, Liu ZP, Zhang XS, Chen L: Prediction of hot spots in protein interfaces using a random forest model with hybrid features. Protein Eng Des Sel. 2012, 25 (3): 119-126. 10.1093/protein/gzr066.
Grosdidier S, Fernandez-Recio J: Identification of hot spot residues in protein–protein interactions by computational docking. BMC Bioinformatics. 2008, 9: 447-10.1186/1471-2105-9-447.
Gao Y, Wang R, Lai L: Structure-based method for analyzing protein-protein interfaces. J Mol Model. 2004, 10: 44-54. 10.1007/s00894-003-0168-3.
Gonzalez-Ruiz D, Gohlke H: Targeting protein-protein interactions with small molecules: challenges and perspectives for computational binding epitope detection and ligand finding. Curr Med Chem. 2006, 13: 2607-2625. 10.2174/092986706778201530.
Xia JF, Zhao XM, Song J, Huang DS: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinformatics. 2010, 11: 174-10.1186/1471-2105-11-174.
Sukhwal A, Sowdhamini R: Oligomerisation status and evolutionary conservation of interface of protein structural domain superfamilies. Mol BioSyst. 2013, 9: 1652-1661. 10.1039/c3mb25484d.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22: 1658-1659. 10.1093/bioinformatics/btl158.
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: ClustalW and ClustalX version 2. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.
Johnson MS, Overington JP, Blundell TL: A structural basis for sequence comparisons: an evaluation of scoring methodologies. J Mol Biol. 1993, 233: 716-738. 10.1006/jmbi.1993.1548.
Pugalenthi G, Suganthan PN, Sowdhamini R, Chakrabarti S: SMotif: a server for structural motifs in proteins. Bioinformatics. 2007, 23 (5): 637-638. 10.1093/bioinformatics/btl679.
Wallis R, Leung KY, Osborne MJ, James R, Moore GR, Kleanthous C: Specificity in protein–protein recognition: conserved Im9 residues are the major determinants of stability in the colicin E9 dnase Im9 complex. Biochemistry. 1998, 37: 476-485. 10.1021/bi971884a.
Curtis MD, James R: Investigation of the specificity of the interaction between colicin E9 and its immunity protein by site-directed mutagenesis. Mol Microbiol. 1991, 5: 2727-2733. 10.1111/j.1365-2958.1991.tb01981.x.
Radisky ES, Lu CJK, Kwan G, Koshland DE: Role of the intermolecular hydrogen bond network in the inhibitory power of chymotrypsin inhibitor 2. Biochemistry. 2005, 44 (18): 6823-6830. 10.1021/bi047301w.
Radisky ES, Kwan G, Lu CJK, Koshland DE: Binding, proteolytic, and crystallographic analyses of mutations at the protease − inhibitor interface of the subtilisin BPN‘/chymotrypsin inhibitor 2 complex. Biochemistry. 2004, 43 (43): 13648-13656. 10.1021/bi048797k.
The work was supported by Centre of Excellence Grant (BT/01/CoE/09/01) of Department of Biotechnology (DBT), India. AS and MM were supported by this DBT grant. PS is supported by Department of Biotechnology scholarship. The authors would like to thank NCBS (TIFR) for infrastructural support.
The authors declare that they have no competing interests.
RS conceived the idea and RS and PS designed the algorithm. PS is involved in the entire coding of the algorithm. MM and PS performed evaluation of the algorithm. MM and AS performed case studies. PS, MM, AS and RS wrote the manuscript. All authors have read and approved the final version of the manuscript.
Electronic supplementary material
Additional file 1: Hotspot and non-hotspot residues from the Alanine scanning database used as the training set.(XLSX 18 KB)
Additional file 5: Availability of homologues at different sequence identity threshold for some PDB entries.(XLSX 11 KB)
Additional file 6: Threshold values for each scoring scheme and corresponding “true positive rate” and “false positive rate”.(XLSX 24 KB)
About this article
Cite this article
Shingate, P., Manoharan, M., Sukhwal, A. et al. ECMIS: computational approach for the identification of hotspots at protein-protein interfaces. BMC Bioinformatics 15, 303 (2014). https://doi.org/10.1186/1471-2105-15-303