Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information
© Chen and Li; licensee BioMed Central Ltd. 2010
Received: 25 March 2010
Accepted: 28 July 2010
Published: 28 July 2010
Protein-protein interactions play essential roles in protein function determination and drug design. Numerous methods have been proposed to recognize their interaction sites, however, only a small proportion of protein complexes have been successfully resolved due to the high cost. Therefore, it is important to improve the performance for predicting protein interaction sites based on primary sequence alone.
We propose a new idea to construct an integrative profile for each residue in a protein by combining its hydrophobic and evolutionary information. A support vector machine (SVM) ensemble is then developed, where SVMs train on different pairs of positive (interface sites) and negative (non-interface sites) subsets. The subsets having roughly the same sizes are grouped in the order of accessible surface area change before and after complexation. A self-organizing map (SOM) technique is applied to group similar input vectors to make more accurate the identification of interface residues. An ensemble of ten-SVMs achieves an MCC improvement by around 8% and F1 improvement by around 9% over that of three-SVMs. As expected, SVM ensembles constantly perform better than individual SVMs. In addition, the model by the integrative profiles outperforms that based on the sequence profile or the hydropathy scale alone. As our method uses a small number of features to encode the input vectors, our model is simpler, faster and more accurate than the existing methods.
The integrative profile by combining hydrophobic and evolutionary information contributes most to the protein-protein interaction prediction. Results show that evolutionary context of residue with respect to hydrophobicity makes better the identification of protein interface residues. In addition, the ensemble of SVM classifiers improves the prediction performance.
Datasets and software are available at http://mail.ustc.edu.cn/~bigeagle/BMCBioinfo2010/index.htm.
In living cells, proteins interact with other proteins in order to perform specific biological functions, such as signal transduction or immunological recognition, DNA replication and gene translation, as well as protein synthesis . These interactions are localized to the so-called "interaction sites" or "interface residues".
Identification of these residues will allow us to understand how proteins recognize other molecules and to gain clues into their possible functions at the level of the cell and at the organism. It can also improve our understanding on disease mechanisms and further advance pharmaceutical design [2, 3]. 3D (three-dimensional) structures of proteins are the basis for the identification. However, resolving 3D protein structures by experimental methods, such as X-ray crystallography and nuclear magnetic resonance, is much more time-consuming than sequencing proteins. This is the reason why less than 62300 protein structures are available in PDB databank  while more than ten million proteins are sequenced in the UniProtKB/TrEMBL database , as of Jan. 2010. To narrow the huge gap, various computational methods have been developed to predict protein structures, assisted by the abundance of protein information deposited in various biological databases. Among them, methods to identify protein-protein interface residues have attracted research attention for a long time.
The pioneering work by Kini and Evans addressed the issue of protein interaction site prediction by a unique predictive method based on the observation that "proline" is the most common residue found in the flanking segments of interaction sites . Jones and Thornton were aimed to analyze  and predict  surface patches that overlap with interfaces by computing a combined score that gives the probability of a surface patch forming protein-protein interactions. Other works have addressed various aspects of protein structure and behavior, such as detecting patch analysis , solvent-accessible surface area buried upon association , free energy changes upon alanine-scanning mutations , in silico two hybrid systems , sequence or structure conservation information [13–17], and sequence hydrophobicity distribution .
Among them, many machine learning methods have been developed or adopted, such as those using support vector machine (SVM) [16, 17, 19–22], neural network [13–15, 23, 24], genetic algorithm [25, 26], hidden Markov models , Bayesian networks [28, 29], random forests [30, 31], and so on.
Numerous properties were used in previous work to identify protein-protein interactions. They can be roughly divided into two categories: sequence-based properties and structure-based properties. Sequence-based properties include residue composition and propensity [7, 22], hydrophobic scale , predicted structural features such as predicted secondary structures , features from multiple sequence alignments [17, 33], and so on . On the other hand structure-based properties were also widely utilized, such as the size of interfaces [7, 35], shape of interfaces [36–38], clustering of interface atoms [39, 40], B-factor , electrostatic potential [19, 21], spatial distribution of interface residues [39, 40], and others . The existing methods using these properties showed good performance in the prediction of protein-protein interactions. However, those properties that are specifically significant for particular protein complexes have not been fully assessed. Furthermore, a large set of properties do not always perform well.
Since the amount of protein structures is significantly smaller than those of protein sequences determined by large-scale DNA sequencing methods, it is important to identify protein-protein interaction sites from amino acid sequences alone. It is also valuable to use sequence-based features without experimental 3D structure information. Actually, predicted structure features such as secondary structure can still be helpful to the identification of interaction sites . However, sequence based approaches to identify protein interaction sites are still more difficult to those based on structure information. The reasons are in that: (1) the relationship between sequence-based features and protein-protein interactions are not fully understood; (2) how to represent each residue in a protein by a series of sequence-based features is difficult; (3) the unbalanced data between interaction samples and non-interaction samples may worsen the interface identification .
This work addresses these issues by integrative features and by adopting an SVM ensemble method based on balanced training datasets. Since identification of interaction sites in hetero-complexes are much more difficult and more interesting than that in homo-complexes, in this work we focus on hetero-complexes. We first design a schema to represent each residue that integrates hydrophobic and evolutionary information of the residue in a complex. Then an ensemble of SVMs is developed, where SVMs train on different pairs of positive (interface samples) and negative (non-interface samples) subsets. The subsets having roughly the same sizes are grouped in the order of accessible surface area change (ΔASA) before and after complexation. A self-organizing map (SOM) technique  is applied to group similar training samples. This is aimed to make more accurate the identification of interface residues. An ensemble of ten-SVMs achieves an MCC improvement by around 8% and F1 improvement by around 9%, compared to those by three-SVMs. We also found that the SVMs ensemble always performs better than individual SVMs. Moreover, using SOM technique achieves an increase of MCC by 1.3 and an increase of F1 by 2%.
We calculated amino acid composition in our dataset to show the propensity information of the 20 amino acid types between interface and non-interface regions. The propensities for the 20 amino acid types in a logarithm (log2) scale are shown in Additional file 1. Results show that amino acids with smaller propensity values, such as 'A', 'G', and 'V', representing hydrophobicity, are always involved in non-interface regions. Conversely, hydrophilic amino acids 'R', 'Y', 'W', and 'H' often present in interface regions. Some of these discoveries are consistent with other literature [18, 43]. Interestingly, Arginine is the most frequently occurring residue in interface regions while Cysteine and Alanine appear in non-interface regions mostly.
Determination of the sliding window length
A sliding window technique is used to represent each target residue in this study, where the most challenging issue is to represent each residue by a feature vector and further to construct a predictor. Our first step is the determination of a good sliding window length since prediction performance is usually varied with window length L. The tradeoff between prediction performance and the algorithm complexity is also concerned. In this work three individual SVMs were selected from the ten-SVMs without SOM and therefore 120 possible combinations were obtained. The average performance of those SVMs was used to determine the window length. Here five levels of window length, 5, 11, 15, 19, and 27 were attempted. Results show that a sliding window with 19 residues is sufficient to train and test our model, although the model with a window length 27 performed a little better than that with a window length 19. However, the model performed faster than that with the window length 27. The comparison of sensitivity-precision under different window lengthes is illustrated in Additional file 2. Note that using a window length 5 leads to the worst performance. If not otherwise stated in this work, we adopt the window length 19 to evaluate our model and identify protein-protein interface residues.
Prediction performance without SOM
Additional file 3 shows the performance comparison among the combined SVMs as discussed above with three thresholds. Because none of single measures can fully evaluate prediction performance, we just show all the evaluations on our predictor under six measurements. In this work, MCC and F1 are used as the main measures to evaluate our method. Actually using MCC as a benchmark measurement may lead to cover less positive samples, while using F1 to achieve balanced performance between sensitivity and precision measures may lead to truly identify less positive samples. From this figure, SVM with threshold 3 performs better than those with thresholds 1 and 2, and achieves a sensitivity of 31.39%, precision of 81.12%, specificity of 96.74%, accuracy of 76.6%, and F1 of 45.27% when reaching the largest MCC of 0.4009. In the case of benchmark measurement of F1, additionally, our model with threshold 3 achieve s a sensitivity of 78.44%, precision of 46.79%, specificity of 60.26%, accuracy of 65.86%, MCC of 0.3576, when reaching the largest F1 of 58.62%.
Performance comparison by samples selection by ΔASA
Results of predictions by random sample selection
However, another issue we would like to address is that why the models with random sample selection perform better than those with ΔASAs-sorted sample selection before the combination of classifiers and, surprisingly, why they perform similarly after the combination of classifiers. The reason is probably in that models have been trained efficiently with feasible ΔASAs distribution of training data compared to that of test data. Furthermore, our results suggest that if the ΔASAs distribution of the training data is consistent with that of test data, a good prediction can be yielded.
Prediction results of combined SVMs
Prediction performance with the use of SOM
Due to the limitation of residue amount in proteins, adopting more neurons in SOM is not always a good idea to cluster the similar input vectors for residues. Therefore in this work three kinds of SOMs, 3 × 3, 5 × 5, and 7 × 7 SOMs, were investigated. Handling by the two modifications, the relatively less important neurons associated with a small number of samples and those neurons with relatively larger entropies were removed.
Evaluation with and without the use of SOM on ensemble of the ten-SVMs
3 × 3*
3 × 3
5 × 5
7 × 7
Improvement by using evolutionary context of residues with respect to hydrophobicity
Kauzmann  first pointed out that hydrophobic effect is the most significant property of protein folding and stability. As for the interface prediction, it is often a major contributor to stabilize protein complexes . Gallet et al. proposed a fast method to predict protein interaction sites by analyzing hydrophobicity distribution . This work suggested that interface residues can be identified by using the mean hydrophobicity and the mean hydrophobic moment. However, it appears that the hydrophobic effect alone is insufficient to the protein interface prediction  or does not appear to be useful for the interface prediction.
In this work, we used two feature profiles, sequence profile and hydropathy scale. The former was extracted from the HSSP database , where each amino acid is represented by elements whose values are based on multiple alignments of protein sequences and their potential structural homologs. The latter was adopted from Kyte-Doolittle's measurement . Despite the two profiles have been used before in interface prediction, the novel integrative technique here can discover the residue's evolutionary context with respect to hydrophobicity in protein-protein interacting sites. It can thus be helpful to improve the interface prediction.
Prediction results of ensembles of ten-SVMs with three profiles
A biological case of improvement by classifier ensemble
Classifier ensemble might perform well in many classifications. Combining the outputs of a number of independent classifiers can improve classification rate since the errors made by a classifier may be corrected by the others [48–50]. Hansen and Salamon  denoted that better performance can be achieved by using the optimizational parameters and training different classifiers on different portion of the dataset. In this work, we applied the classifier ensemble technique to combine the outputs from the ten independent SVM classifiers whose training datasets are non-overlapped and thus independent to each other.
Comparison with other methods
Actually, using true secondary structure information and other real 3D structure information in Sikic's method may lead to overestimate the interface predictions, although it obtained a little higher precision than our model with sensitivities from 50% to 90%. Therefore, we just show the performance curve of Sikic's method based on both sequence and real 3D structure, with no purpose of comparison. However, as discussed in Figure 2, our model with threshold 9 performs similar to Sikic's structure-based model when achieving sensitivities from 50% to 90%. It should be noted that our model and Sikic's method share the same definition of interface residues and therefore obtains approximately the same ratio of interface residues to total residues, 27.56% in our dataset and 27.5% in Sikic's method. As a result, our method outperforms Sikic's method based on sequence information. Furthermore, our method based on sequence alone performs similarly to Sikic's method based on both sequence and 3D structure.
Performance of methods on hetero-complexes with sequence alone
Wang and Chen
Res et al.
Koike and Takagi*
Sikic et al.
Chen and Jeong
ISIS et al.
Ofran and Rost
To show the potential of our model to practical problem, a CCD-IBD complex (PDB:2bgn) was taken as a test case. Again the evaluation of this blind test is based solely on sequence information without knowing 3D structure of the complex and the true interacting residues.
This paper addresses the problem of identifying interface residues in hetero-complexes by using an integrative profiling. This novel profile combines residue sequence profile with hydropathy scale and, therefore obtains standard deviation value for each residue in proteins. The deviation value may reveal the evolutionary relationship of a residue in proteins and hydrophobicity in water surroundings. The novel residue profile and an ensemble of SVMs together achieves a good prediction in protein-protein interactions with a sensitivity of 39.76%, precision of 83.07%, specificity of 96.91%, accuracy of 81.14%, and F1 of 53.78% when achieving the largest MCC of 0.4842. In addition, SOM technique is adopted to investigate the interacting relationship of residues. When the SOM technique is used, the prediction performance increases to a sensitivity of 42.84%, precision of 81.96%, specificity of 96.35%, accuracy of 81.39%, and F1 of 56.25% when achieving the largest MCC of 0.4979.
Moreover a residue in our work was represented as a 1-by-19 vector by using the sliding window with length 19. The scale is much smaller than most other methods. The input vector for representing a residue used in Sikic et al.'s method contained 9 × 20 = 180 elements and, 1050 features were used as input vector in Chen and Jeong's method. Therefore our model is very fast and simple. More importantly, a larger number of features in input vectors does not necessarily lead to a better performance. As pointed out by previous work, a machine learning algorithm adopting a simple representation of a sequence space could be much more powerful and useful than using the original data containing all details . Actually biological properties which may be responsible for protein-protein interactions are not fully understood. Therefore how to apply feasible features or feature transformations in protein interaction prediction remains an open problem. Additionally imbalanced data of interface residues and non-interface residues is a very challenging issue, which always causes classifier over-fitting. The ensemble of classifiers may be a feasible pathway to balance training data.
Finally, residue's evolutionary context with respect to hydrophobicity plays an important role in the interface prediction. Above discussion appears to suggest that integrating residue's evolutionary context with other properties of residues, such as residue volume or free energy solution in water, is a plausible way to discover the protein-protein interactions. In our future work, we will investigate the inner relationships of interacting residues, and make use of them for a more accurate prediction.
The complexes used in this work were extracted from the 3dComplex database , which is an database for automatically generating non-redundant sets of complexes. Only those proteins in hetero-complexes with sequence identity ≤ 30% were selected in this work. Meanwhile, proteins and molecules with fewer than 30 residues were excluded from our dataset. Protein chains which are not available in HSSP database  were also removed. As a result, our dataset contains 2499 protein chains in 737 complexes. There are mainly two definitions for protein interface residues. The first one is based on differences in ASA of the residues before and after complexation, and the second is based on distance between interacting residues. In this article, the ASA change is used to extract interface residues. We applied the PSAIA software to the extraction . In our case, a residue is considered to be an interface residue if the difference of its ASA in unbound and bound form is > 1Å2. As a result, we obtained 142410 interface residues (positive samples) and 374346 non-interface residues (negative samples), where the ratio of the number of positive samples to that of all samples is 27.56%.
In this work we applied a 5-fold cross-validation test to evaluate our proposed method. In this case, proteins in the dataset are divided into 5 subsets which consist of roughly the same number of proteins, one subset is for the test process and the other ones are for the training process.
Sliding window technique
Generation of residue profiles
where and denote the k-th value of SP i and KD i for residue i, respectively, and denotes the mean value of vector SP × KD. Note that Equation (4) is an unbiased estimation of . In addition and represent the same amino acid type. For instance, and all represent residue 'ALA'.
The number of positive samples or so-called interface residues is much smaller than that of negative samples or non-interface residues. Only 27.56% of the samples are interface residues in this work, which leads to a rather imbalanced data distribution. To overcome this problem, the training positive and negative samples are divided into several subsets without overlap, which have roughly the same sizes, in terms of the order of ΔASA of the corresponding residues before and after complexation. In the case of 5-fold cross-validation test, the positive samples are grouped into two subsets in the order of ΔASA and, the negative samples with ΔASAs ≡ 0Å2 are randomly grouped into five subsets due to only a small number of negative samples with 0 < ΔASAs ≤ 1Å2.
SVMs are accurate classifiers while they can avoid over-fitting [58, 59]. The SVM learner aims to judge whether a residue is located at an interface region or not. As discussed above, there are ten SVMs in the 5-fold test. Here, input profile vector for each residue is extracted as above, and the target value of which is labeled as 1 (positive sample) if the residue is located at interface region and 0 (negative sample) otherwise.
In this study, SOM technique is adopted to group similar input samples and make them more separable . The purpose of SOM is to detect regularities and correlations in their input, and also to recognize groups of similar input vectors. It can adapt their future responses to that input accordingly in such a way that neurons of competitive networks physically near each other in the neuron layer respond to similar input vectors . Readers can be referred to the Additional file 4 for details. Here, we created SOM networks with N-by-N neurons in a hexagonal layer topology, trained the network on the training set in our dataset by 20 steps, tested proteins on test dataset, and finally obtained N × N clusters of similar input samples.
Two modifications to the traditional SOM technique are used here, including
Delete the relatively less important nodes associated with a small number of input samples;
Use a validation index to choose clusters with the optimal size of the map.
Where , n = 1, ..., N, denotes an input sample, w r , r = 1, ..., R, denotes the corresponding weight vector, and U rn satisfies 0 ≤ U rn ≤ 1.
Measures for performance evaluation
where TP (True Positive) is the number of true positives, i.e., residues predicted to be interface residues that actually are interface residues; FP (False Positive) is the number of false positives, i.e., residues predicted to be interface residues that are in fact not interface residues; TN (True Negative) is the number of true non-interface residues; and FN (False Negative) is the number of false non-interface residues. The MCC is a measure of how well the predicted class labels correlate with the actual class labels. Its value range is from -1 to 1. An MCC of 1 corresponds to the perfect prediction, while -1 indicates the worst possible prediction; an MCC of 0 corresponds to a random guess.
This work was supported by the Singapore MOE ARC Tier-2 funding grant T208B2203.
- Alberts BD, Lewis J, Raff M, Roberts K, Watson JD: Molecular Biology of the Cell. 2nd edition. New York: Garland; 1989.Google Scholar
- Bollenbach TJ, Nowak T: Kinetic Linked-Function Analysis of the Multiligand Interactions on Mg2+-Activated Yeast Pyruvate Kinase. Biochemistry 2001, 40(43):13097–13106. 10.1021/bi010126oView ArticlePubMedGoogle Scholar
- Chelliah V, Chen L, Blundell TL, Lovell SC: Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J Mol Biol 2004, 342: 1487–1504. 10.1016/j.jmb.2004.08.022View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235View ArticlePubMedPubMed CentralGoogle Scholar
- Uni-Prot-Consortium: The universal protein resource (UniProt). Nucleic Acids Res 2008, 36: D190-D195. 10.1093/nar/gkn141View ArticleGoogle Scholar
- Kini RM, Evans HJ: Prediction of potential protein-protein interaction sites from amino acid sequence identification of a fibrin polymerization site. FEBS Lett 1996, 385: 81–86. 10.1016/0014-5793(96)00327-4View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Prediction of protein-protein interaction sites using patch analysis. J Mol Biol 1997, 272: 133–143. 10.1006/jmbi.1997.1233View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. J Mol Biol 1997, 272: 121–132. 10.1006/jmbi.1997.1234View ArticlePubMedGoogle Scholar
- Murakami Y, Jones S: SHARP2: protein-protein interaction predictions using patch analysis. Bioinformatics 2006, 22: 1794–5. 10.1093/bioinformatics/btl171View ArticlePubMedGoogle Scholar
- Janin J: Specific vs. non-specific contacts in protein crystals. Nat Struct Biol 1997, 4: 973–974. 10.1038/nsb1297-973View ArticlePubMedGoogle Scholar
- Thorn KS, Bogan AA: ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics 2001, 17: 284–285. 10.1093/bioinformatics/17.3.284View ArticlePubMedGoogle Scholar
- Pazos F, Valencia A: In silico two hybrid system for the selection of physically interacting protein pairs. Proteins 2002, 47: 219–227. 10.1002/prot.10074View ArticlePubMedGoogle Scholar
- Zhou H, Shan Y: Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins 2001, 44: 336–343. 10.1002/prot.1099View ArticlePubMedGoogle Scholar
- Fariselli P, Pazos F, Valencia A, Casadia R: Prediction of protein-protein interaction sites in heterocomplexes with neural networks. Eur J Biochem 2002, 269: 1356–1361. 10.1046/j.1432-1033.2002.02767.xView ArticlePubMedGoogle Scholar
- Ofran Y, Rost B: Predicted protein-protein interaction sites from local sequence information. FEBS Lett 2003, 544: 236–239. 10.1016/S0014-5793(03)00456-3View ArticlePubMedGoogle Scholar
- Res I, Mihalek I, Lichtarge O: An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics 2005, 21: 2496–2501. 10.1093/bioinformatics/bti340View ArticlePubMedGoogle Scholar
- Wang B, Chen P, Huang DS, Li JJ, Lok TM, et al.: Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett 2006, 580: 380–384. 10.1016/j.febslet.2005.11.081View ArticlePubMedGoogle Scholar
- Gallet X, Charloteaux B, Thomas A, Brasseur R: A fast method to predict protein interaction sites from sequences. J Mol Biol 2000, 302: 917–926. 10.1006/jmbi.2000.4092View ArticlePubMedGoogle Scholar
- Bradford JR, Westhead DR: Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics 2005, 21: 1487–94. 10.1093/bioinformatics/bti242View ArticlePubMedGoogle Scholar
- Bordner AJ, Abagyan R: Statistical analysis and prediction of protein-protein interfaces. Proteins 2005, 60: 353–66. 10.1002/prot.20433View ArticlePubMedGoogle Scholar
- Chung J, Wang W, Bourne PE: Exploiting sequence and structure homologs to identify protein-protein binding sites. Proteins 2006, 62: 630–40. 10.1002/prot.20741View ArticlePubMedGoogle Scholar
- Dong Q, Wang X, Lin L, Guan Y: Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins. BMC Bioinformatics 2007, 8: 147. 10.1186/1471-2105-8-147View ArticlePubMedPubMed CentralGoogle Scholar
- Chen H, Zhou H: Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data. Proteins 2005, 61: 21–35. 10.1002/prot.20514View ArticlePubMedGoogle Scholar
- Ofran Y, Rost B: ISIS: interaction sites identified from sequence. Bioinformatics 2007, 23: 13–6. 10.1093/bioinformatics/btl303View ArticleGoogle Scholar
- Wang B, Ge LS, Jia WY, Liu L, Chen FC: Prediction of protein interactions by combining genetic algorithm with SVM method. Evolutionary Computation, 2007. CEC 2007. IEEE Congress on 2007, 320–325. full_textView ArticleGoogle Scholar
- Du X, Cheng J, Song J: Improved Prediction of Protein Binding Sites from Sequences Using Genetic Algorithm. The Protein Journal 2009, 28(6):273–280. 10.1007/s10930-009-9192-1View ArticlePubMedGoogle Scholar
- Friedrich T, Pils B, Dandekar T, et al.: Modelling interaction sites in protein domains with interaction profile hidden Markov models. Bioinformatics 2006, 22: 2851–7. 10.1093/bioinformatics/btl486View ArticlePubMedGoogle Scholar
- H N, R R, G S: ProMate: a structure based prediction program to identify the location of protein-protein binding sites. J Mol Biol 2004, 338: 181–99. 10.1016/j.jmb.2004.02.040View ArticleGoogle Scholar
- Bradford JR, Needham CJ, Bulpitt AJ: Insights into protein-protein interfaces using a Bayesian network prediction method. J Mol Biol 2006, 362: 365–86. 10.1016/j.jmb.2006.07.028View ArticlePubMedGoogle Scholar
- Chen XW, Jeong JC: Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 2009, 25(5):585–591. 10.1093/bioinformatics/btp039View ArticlePubMedGoogle Scholar
- Sikic M, Tomic S, Vlahovicek K: Prediction of Protein-Protein Interaction Sites in Sequences and 3D Structures by Random Forests. PLoS Comput Biol 2009, 5(1):e1000278. 10.1371/journal.pcbi.1000278View ArticlePubMedPubMed CentralGoogle Scholar
- Glaser F, Steinberg DM, Vakser IA, et al.: Residue frequencies and pairing preferences at protein-protein interfaces. Proteins 2001, 43: 89–102. 10.1002/1097-0134(20010501)43:2<89::AID-PROT1021>3.0.CO;2-HView ArticlePubMedGoogle Scholar
- Guharoy M, Chakrabarti P: Conservation and relative importance of residues across protein-protein interfaces. PNAS 2005, 102: 15447–52. 10.1073/pnas.0505425102View ArticlePubMedPubMed CentralGoogle Scholar
- Ezkurdia I, Bartoli L, Fariselli P, Casadio R, Valencia A, Tress ML: Progress and challenges in predicting protein-protein interaction sites. Briefings in Bioinformatics 2009, 10(3):233–246. 10.1093/bib/bbp021View ArticlePubMedGoogle Scholar
- Porollo A, Meller J: Prediction-based fingerprints of protein-protein interactions. Proteins 2007, 66: 630–45. 10.1002/prot.21248View ArticlePubMedGoogle Scholar
- Laskowski RA: SURFNET: A program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 1995, 13: 323–330. 10.1016/0263-7855(95)00073-9View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Principles of proteinprotein interactions. Proc Natl Acad Sci USA 1996, 93: 13–20. 10.1073/pnas.93.1.13View ArticlePubMedPubMed CentralGoogle Scholar
- Bahadur RP, Chakrabarti P, Rodier F, Janin J: A dissection of specific and non-specific protein-protein interfaces. J Mol Biol 2004, 336: 943–955. 10.1016/j.jmb.2003.12.073View ArticlePubMedGoogle Scholar
- Chakrabarti P, Janin J: Dissecting protein-protein recognition sites. Proteins 2002, 47: 334–343. 10.1002/prot.10085View ArticlePubMedGoogle Scholar
- Bahadur RP, Chakrabarti P, Rodier F, Janin J: Dissecting subunit interfaces in homodimeric proteins. Proteins 2003, 53: 708–719. 10.1002/prot.10461View ArticlePubMedGoogle Scholar
- Singh R, Xu J, Berger B: Struct2net: integrating structure into protein-protein interaction prediction. Pac Symp Biocomput 2006, 11: 403–14. full_textGoogle Scholar
- Kohonen T: Self-Organizing Maps. 2nd edition. Heidelberg: Springer; 1997.View ArticleGoogle Scholar
- Ofran Y, Rost B: Analysing six types of protein-protein interfaces. J Mol Biol 2003, 325: 377–387. 10.1016/S0022-2836(02)01223-8View ArticlePubMedGoogle Scholar
- Kauzmann W: Some factors in the interpretation of protein denaturation. Adv Protein Chem 1959, 14: 1–63. full_textView ArticlePubMedGoogle Scholar
- Lo Conte L, Chothia C, Janin J: The atomic structure of protein-protein recognition sites. J Mol Biol 1999, 285: 2177–2198. 10.1006/jmbi.1998.2439View ArticlePubMedGoogle Scholar
- Sander C, Schneider R: Database of homology derived protein structures and the structural meaning of sequence alignment. Proteins 1991, 9: 56–68. 10.1002/prot.340090107View ArticlePubMedGoogle Scholar
- Kyte J, Doolittle R: A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982, 157: 105–132. 10.1016/0022-2836(82)90515-0View ArticlePubMedGoogle Scholar
- Hansen LK, Salamon P: Neural network ensembles. IEEE Trans Pattern Anal Mach Intell 1990, 12: 993–1001. 10.1109/34.58871View ArticleGoogle Scholar
- Kittler J, Alkoot FM: Sum versus vote fusion in multiple classifier systems. IEEE Trans Pattern Anal Mach Intell 2003, 25: 110–115. 10.1109/TPAMI.2003.1159950View ArticleGoogle Scholar
- Kuncheva LI: Combing pattern classifiers: methods and algorithms. U.S.: Wiley; 2004. full_textView ArticleGoogle Scholar
- Cherepanov P, Ambrosio ALB, Rahman S, Ellenberger T, Engelman A: Structural basis for the recognition between HIV-1 integrase and transcriptional coactivator p75. PNAS 2005, 102(48):17308–17313. 10.1073/pnas.0506924102View ArticlePubMedPubMed CentralGoogle Scholar
- Cherepanov P, Devroe E, Silver PA, Engelman A: Identification of an evolutionarily conserved domain in human lens epithelium-derived growth factor/transcriptional co-activator p75 (LEDGF/p75) that binds HIV-1 integrase. J Biol Chem 2004, 279: 48883–48892. 10.1074/jbc.M406307200View ArticlePubMedGoogle Scholar
- Baldi P, Brunak S: Bioinformatics: The machine learning approach. London, England: The MIT Press; 2000.Google Scholar
- Levy ED, Pereira-Leal JB, Chothia C: Teichmann SA 3D complex: a structural classification of protein complexes. PLoS Comput Biol 2006, 2(11):e155. 10.1371/journal.pcbi.0020155View ArticlePubMedPubMed CentralGoogle Scholar
- Mihel J, Sikic M, Tomic S, Jeren B, Vlahovicek K: PSAIA-Protein Structure and Interaction Analyzer. BMC Struct Biol 2008, 8: 21. 10.1186/1472-6807-8-21View ArticlePubMedPubMed CentralGoogle Scholar
- Larsen TA, Olson AJ, Goodsell DS: Morphology of protein-protein interfaces. Structure 1998, 6: 421–7. 10.1016/S0969-2126(98)00044-6View ArticlePubMedGoogle Scholar
- Charton M, Charton BI: The structural dependence of amino acid hydrophobicity parameters. J Theor Biol 1982, 99: 629–644. 10.1016/0022-5193(82)90191-6View ArticlePubMedGoogle Scholar
- Cortes C, Vapnik V: Support-Vector Networks. Machine Learning 1995, 20: 273–297.Google Scholar
- Chen P, Wang B, Wong HS, Huang DS: Prediction of protein B-factors using multi-class bounded SVM. Protein and Peptide Letters 2007, 14(2):185–190. 10.2174/092986607779816078View ArticlePubMedGoogle Scholar
- Bezdek JC, Ehrlich R, Full W: FCM: fuzzy c-means algorithm. Comput Geosci 1984, 10(2–3):191–203. 10.1016/0098-3004(84)90020-7View ArticleGoogle Scholar
- Pascual-Marqui RD, Pascual-Montano AD, Kochi K, Carazo JM: Smoothly distributed fuzzy c-means: a new self-organizing map. Pattern Recognition 2001, 34: 2395–2402. 10.1016/S0031-3203(00)00167-9View ArticleGoogle Scholar
- Wong HS, Ma B, Sha Y, Ip HHS: 3D head model retrieval in kernel feature space using HSOM. Pattern Recognition 2008, 41: 468–483. 10.1016/j.patcog.2007.06.009View ArticleGoogle Scholar
- de Vries SJ, Bonvin AM: How proteins get in touch: interface prediction in the study of biomolecular complexes. Curr Protein Pept Sci 2008, 9(4):394–406. 10.2174/138920308785132712View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.