High-throughput identification of interacting protein-protein binding sites
© Chung et al. 2007
Received: 21 December 2006
Accepted: 27 June 2007
Published: 27 June 2007
Skip to main content
© Chung et al. 2007
Received: 21 December 2006
Accepted: 27 June 2007
Published: 27 June 2007
With the advent of increasing sequence and structural data, a number of methods have been proposed to locate putative protein binding sites from protein surfaces. Therefore, methods that are able to identify whether these binding sites interact are needed.
We have developed a new method using a machine learning approach to detect if protein binding sites, once identified, interact with each other. The method exploits information relating to sequence and structural complementary across protein interfaces and has been tested on a non-redundant data set consisting of 584 homo-dimers and 198 hetero-dimers extracted from the PDB. Results indicate 87.4% of the interacting binding sites and 68.6% non-interacting binding sites were correctly identified. Furthermore, we built a pipeline that links this method to a modified version of our previously developed method that predicts the location of binding sites.
We have demonstrated that this high-throughput pipeline is capable of identifying binding sites for proteins, their interacting binding sites and, ultimately, their binding partners on a large scale.
Protein-protein interactions are essential to most biological processes, for example, signal transduction, hormone-receptor binding and immunological recognition. These processes comprise complex cellular protein interaction networks that are becoming increasingly accessible in the post-genome era of high-throughput proteomics. Experimental methods such as mass spectrometry, phage display and yeast two hybrid have been developed to quickly identify interactions between proteins in various organisms [1–4]. Concurrently, computational approaches exploiting amino acid properties, genomic and evolutionary information [5–13] have been proposed to determine whether proteins interact or not (binary interactions). While both large scale experimental and computational methods are known to produce many false positive and false negative predictions, the combination of using several methods may provide more reliable results. The idea of using consensus results is not new and has been used in the meta-servers for structure prediction, generating consensus models according to the results of several prediction servers . Here we provide a new approach to the computational prediction of interacting protein-protein binding sites which can contribute to this greater accuracy .
Thanks, in large part, to the availability of increasing sequence and structural information, various computational methods have been proposed to identify putative protein-protein binding sites utilizing evolutionary relationships [15–19], properties of surface patches [20, 21], residue hydrophobicity , etc. Recently, machine learning approaches such as neural networks [23–26], support vector machines [27–31] and Bayesian network  have been used to distinguish interface residues from non-interface residues based on sequence and structural properties.
All of these methods locate binding sites from protein surfaces, but none of them provide information about their binding partners (binding specificity). Therefore, methods that identify interacting protein binding sites are necessary. Inherently these methods would then allow more reliable determination of binary interactions. Docking approaches provide this information by predicting the binary complex of two known structures based on energetic or geometric complementary [33, 34]. However, long computation time is often required to determine each putative complex and most docking approaches are limited to rigid protein model analysis. Homology modeling  and multimeric threading  build an atomic model of a complex based on a template structure using sequence alignments. These two methods have been tested on large scale data sets [37, 38]. They both rely on the limited number of structure templates of complexes  and usually require sequence identity above 30% between homologs [40, 41]. Aytuna et al.  predicted protein-protein complexes by seeking pairs of proteins that share structurally and evolutionarily conserved residue similarity to 67 template interfaces. Pazos et al.  utilized correlated mutation for determining pairs of proteins that are likely to bind and also identified binding sites concurrently. Although structural information is not required for this method, a large set of multiple sequence alignment for each possible pair of proteins is needed.
Ofran et al. have reported significant differences in residue composition and contact preferences between interfaces of hetero-obligomers, hetero-complexes, homo-obligomes, and homo-complexes . In our data set, some differences between homo-dimer and hetero-dimer interfaces have also been observed. There was a higher tendency for hydrophobic-hydrophobic interactions in homo-dimers. On the other hand, more salt bridges and fewer contacts between residues with the same charge were preferred in hetero-dimers.
Several previous studies have provided detailed analysis of interaction preferences of different types of protein-protein interfaces in terms of amino acid, secondary structure or other properties [47–51]. The results of those previous studies and our studies show some variations because of the different composition of data sets and the definition of the interface residues. Nevertheless, the survey presented here indicates that information from sequence profile, secondary structure and accessible surface area (ASA) may be useful discriminators for defining contacting interface residues and can be captured by SVM predictors.
We then evaluated the prediction method exploiting structural information. At a surface patch size of 1 (Figure 7(b)), incorporating information of secondary structure with the sequence profile increased the prediction accuracy significantly. However, further incorporation of information on the accessible surface area did not result in any additional improvement. As we increased the patch size to 3 (Figure 7(c)), we noticed that secondary structure and ASA had no impact on prediction. Therefore, for this study, we chose a sequence profile with a patch size 3 as the default input features.
The prediction performances
Accuracy for interacting binding sites, also known as recall (%)
Accuracy for non-interacting binding sites (%)
Average accuracy (%)
Mix1 (sequence window)
Mix1 (putative binding sites)
The surface patch used here included only the two nearest surface residues, which are very likely to be located in the same sequence segment. For this reason we further performed a trial using the sequence profile with a window size 5 in sequence (that is, including 4 sequentially nearest residues). As expected, the ROC curves in Figure 7(a) indicates that the predictor using sequence information only was able to perform similarly to the predictor using a patch size of 3. At a threshold of 56%, 88.7% of interacting binding sites and 63.3% of non-interacting binding sites were correctly assigned. When the average accuracy reached its maximum (threshold: 64%), 78.5% of interacting binding sites and 76.9% of non-interacting binding sites were correctly assigned.
In this study, two binding sites were predicted to interact with each other if >56% of the total possible residue pairs between them were predicted to be in contact with each other. Raising the threshold increases the precision but decreases the recall (assigns less interacting binding sites and more non-interacting sites) and vice versa. In the data set, the contacting residue pairs for two interacting binding sites constituted, on average, only approximately 8% of the total possible residue pairs. Therefore, the threshold (56%) selected above seem to be very high. However, when we considered the 3 spatially nearest residues of each any two contacting residues to be in contact with each other across the interface, the fraction of contacting residue pairs out of the total possible residue pairs increased to 47%.
A pipeline was built to test if two putative binding sites would interact with each other (Figure 1). Given two proteins A and B, the pipeline first identifies the putative binding site of each protein (with predictor I) and then identifies the interaction between the two putative binding sites (with predictor II, which is the method presented in this study). This pipeline is able to provide information on both the location of binding sites and their binding partners.
Putative binding sites of individual components of each complex in our data set were determined by a method modified from our previous work, using sequence and structural information  (predictor I, see methods). With this method, the recall was 65.6% and the precision was 45.2% at the residue level. The results are summarized as follows: 49.81% of the binding sites were precisely predicted, 71.30% of the binding sites were correctly predicted and 23.6% of the binding sites were partially covered by the predicted residues. If at least 70% of the residues at a site were identified, we defined this to be precisely predicted. If at least 50% of the residues at a site were identified, we considered this to be correctly predicted.
The pipeline consists of binding site identification (predictor I) and subsequent prediction of whether two of these identified sites interact (predictor II). Since predictor II depends little on spatially neighboring residues and is mostly sequence dependant, a sequence only predictor I could be substituted [19, 42].
We have developed a new method to predict interacting protein binding sites using a machine learning technique. To the best of our knowledge, this method is the first trial that predicts interacting binding sites without the constraint of using structure templates. An SVM was trained to learn the complementary information across interfaces and has been tested on a data set consisting of 584 non-redundant homo-dimer and 198 hetero-dimer interfaces. Our predictor successfully identified 87.4% of the interacting binding sites and 68.6% of the non-interacting binding sites. Separate training and testing on homo-dimers and hetero-dimers showed different prediction results, which might be caused by the differences in residue contact preferences between these two types of interfaces. For homo-dimers, 96.4% of the interacting binding sites and 67.5% of the non-interacting binding sites were correctly identified. For hetero-dimers, 66.3% of the interacting binding sites and 62.8% of the non-interacting binding sites were correctly identified. Better predictions are expected as more structures are determined and the number of homo-dimer and hetero-dimer complexes upon which to train increases.
We built a pipeline combining the method discussed here to a modified version of our previously developed method that identifies the location of binding sites. Taking both predictors together we showed that the prediction accuracy that were based on putative binding sites only decreased slightly over accurately known sites. Thus the pipeline enables the simultaneous prediction of binding sites and binding partners, identifying 87.3% of the interacting binding sites and 67.6% of the non-interacting binding sites in our data set. In the future, the pipeline can be used to search new protein binding sites and interactions in various biological systems, and therefore build interaction networks based on interaction details between proteins. It can also be used to validate existing networks.
At this time it is difficult to compare the results presented here with those of other methods since each uses different training sets and there is a lack of a common test set . In addition, most existing methods have not been tested on negative data and hence prediction statistics (precision, recall, etc.) were not provided. Nevertheless, different methods have different limitations and exploit different information to various extents. For example, most docking procedures are computational expensive, homology modeling and multimeric threading rely on the availability of complex structure templates and correlated mutation methods need a large set of sequence alignment for each possible protein pair. Current efforts are directed at attaining higher prediction accuracy through incorporation of additional information such as local interface geometry or water mediated interactions into our predictor.
A non-redundant data set of dimer complexes was compiled using the method of Zhou et al.  modified as follows. All non-NMR multiple-chain protein entries with resolution better than 3.5 Å were collected from the PDB (March, 2004) . For each entry, two chains were selected as an interacting protein pair if both have more than 20 residues that formed interfacial contacts with each other. A residue was considered to form an interfacial contact if the distance between any of its heavy atoms and any heavy atoms of its interacting proteins were <5 Å. The pairs containing chains with < 80 amino acids or SCOP class >= 8 were then filtered out.
Each of the collected chains was further compared against all other chains by BLAST. Chains were assigned to the same cluster if the sequence identity was > 30% and > 90% of the amino acids were aligned. All interacting protein pairs were mapped to these clusters and the representative pairs were selected. In order to consider dimers only, the representative pairs with chains interacting with more than one chain were discarded. Homo-dimers with both chains having > 30 interface residues and hetero-dimers with both chains having > 20 interface residues were collected, in order to roughly exclude those from crystallographic complexes . This resulted in a non-redundant data set of 584 homo-dimers and 196 hetero-dimers. The data are available upon request from the authors.
We have surveyed the preferences of contacts between different groups of interface residues. We classified the residues into 20 groups in terms of amino acids, 3 groups in terms of secondary structures (alpha helix, beta strand, and others, including coil) and 2 groups in terms of the extent of water exposure (fully exposed: ASA >= 40% of a residue's nominal maximum area ; partially exposed: 15% <= ASA < 40%). The secondary structure and ASA of residues for each protein chain were calculated using the DSSP program  with the coordinates of a single chain obtained from the corresponding complex structure.
The contact preference (L) for interface residues from group a and group b was calculated as follows :
L(a, b) = F observed (a, b)/F expected (a, b)
where the observed contact frequency was defined as:
F observed (a, b) = N observed (a, b)/N total
and N observed (a, b) was the number of contact residue pairs between residue group a and group b. N total was the total number of all contacting residue pairs. The expected contact frequency was defined as:
F expected (a, b) = F(a) × F(b)
where F(a) and F(b) were the frequency of residue group a and group b at interfaces respectively. In this study, a residue was defined as a surface residue if its ASA was at least 15% of its nominal maximum area . A surface residue was defined to be an interface residue (residue at a binding site) if it formed an interfacial contact. The definition of interfacial contact was described in the data set section.
The SVMs were trained to predict if two binding sites interact with each other. The SVM software used in this study was SVM light . The radial basis function exp(-γ||b - a||2) was chosen as a kernel with γ = 0.01 and regularization parameter C = 10.
During the training process (Figure 5), two interface residues, each from interacting binding sites, were considered to form a contacting residue pair (positive class) if the distance between any of their respective heavy atoms was less than 5 Å. A non-contacting residue pair (negative class) was defined as any possible interface residue pair between two non-interacting protein binding sites (binding sites from two non-interacting protein chains). Non-interacting protein chains were generated from our data set having determined that two proteins were not reported to be in the same cellular location as defined by the UniProt database . Since the number of non-interacting protein pairs greatly outnumbered the number of interacting protein pairs, we randomly selected a small portion out of the large pool to be representative data, making the number of non-interacting protein pairs equal to the number of interacting protein pairs. For example, for the combined data set of homo-dimers and hetero-dimers, there were 780 interacting protein pairs. The number of all possible non-interacting protein pairs was 460070. 780 non-interacting protein pairs were further randomly chosen from this pool.
To reduce the data redundancy and training time, for each binding site residue with multiple contacting residues at the other site, only the pairing with the smallest distance was selected to be included in the positive training set. Since there were many more non-interacting residue pairs than interacting residue pairs, a set of non-interacting residue pairs was randomly selected so that the ratio of positive to negative data was 1:1.
The SVM was fed two surface patches, each included a residue of an interface residue pair and its n spatially nearest surface residues (n was an adjustable parameter). The input features were different combinations of sequence profile, secondary structure and accessible surface area of residues in these 2 surface patches. If all three input features were used and surface patch size was set to 3, each residue pair was encoded as a feature vector with a dimension of 2 × 3 × 24: 2 × (the surface residue to be predicted + 2 nearest neighbors) × (20 amino acids + accessible surface area + 3 types of secondary structure). The sequence profiles were obtained from 3 iterations of a PSI-BLAST search against the NCBI non-redundant database (NR) with e = 0.001 and h = 0.001 . The 3 categories of secondary structure were: alpha helix, beta sheet, and others, including coil regions (encoded 1 if it was in this category and -1 if it was not). All input values were scaled between -1 and 1.
During the testing process (Figure 6), two binding sites A and B were predicted to interact with each other if the number of positively predicted residue pairs between them were above a certain threshold:
N predicted ≥ P% × N total
where N predicted is the number of positively predicted residue pairs. N total is the number of total possible residue pairs between binding sites A and B. Given the threshold, the prediction performance was measured as follows:
Precision = TP/(TP +FP)
Accuracy for positive class, or recall, or true positive rate = TP/(TP + FN)
Accuracy for negative class = TN/(TN + FP)
False positive rate = FP/(FP + TN)
Average accuracy = (TP + TN)/(TP + TN + FP + FN)
where TP is the number of correctly predicted interacting binding sites, TN is the number of correctly predicted non-interacting binding sites, FP is the number of non-interacting binding sites incorrectly predicted to be interacting and FN is the number of interacting binding sites incorrectly predicted to be non-interacting.
The putative protein-protein binding sites were determined by a method modified from our previous work . A SVM was trained to locate binding site residues on a protein surface by using sequence profile and accessible surface area of spatially neighboring surface residues. 976 non-redundant chains (584 chains from one of the components of homo-dimers, and 196 × 2 chains from both components of hetero-dimers) were trained and tested with 2 fold cross validation. Each of the other component of a homo-dimer was tested by the training set which didn’t contain its homolog. The residues ranked as the top 30% by SVM were further clustered using the clustering method described in .
This work was supported by NIH grants GM63208 and GM08326.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.