Predicting the protein-protein interactions using primary structures with predicted protein surface
© Chang et al; licensee BioMed Central Ltd. 2010
Published: 18 January 2010
Many biological functions involve various protein-protein interactions (PPIs). Elucidating such interactions is crucial for understanding general principles of cellular systems. Previous studies have shown the potential of predicting PPIs based on only sequence information. Compared to approaches that require other auxiliary information, these sequence-based approaches can be applied to a broader range of applications.
This study presents a novel sequence-based method based on the assumption that protein-protein interactions are more related to amino acids at the surface than those at the core. The present method considers surface information and maintains the advantage of relying on only sequence data by including an accessible surface area (ASA) predictor recently proposed by the authors. This study also reports the experiments conducted to evaluate a) the performance of PPI prediction achieved by including the predicted surface and b) the quality of the predicted surface in comparison with the surface obtained from structures. The experimental results show that surface information helps to predict interacting protein pairs. Furthermore, the prediction performance achieved by using the surface estimated with the ASA predictor is close to that using the surface obtained from protein structures.
This work presents a sequence-based method that takes into account surface information for predicting PPIs. The proposed procedure of surface identification improves the prediction performance with an F-measure of 5.1%. The extracted surfaces are also valuable in other biomedical applications that require similar information.
The different types of interactions among proteins are essential to various biological functions in a living cell. Information about these interactions provides a basis to construct protein interaction networks and improves our understanding of the general principles of the functioning of biological systems . Recent years have seen the development of various experimental techniques for systematic protein-protein interaction (PPI) analysis [2–5]. At present, however, experimentally detected interactions represent only a small fraction of the real interaction network [6, 7]. Therefore, a number of computational approaches have been proposed to expedite the PPI detection process based on only experimental techniques .
Computational methods that depend on not only sequence information but also some prior knowledge of, for example, localization data , structural data [10, 11], expression data [12, 13] or information on the interactions of orthologs [14, 15] cannot be applied on some essential proteins that are observed in most organisms . To solve this problem, several sequence-based algorithms have been developed to detect potentially interacting protein pairs when no auxiliary information is available [17–23].
This work presents a novel sequence-based method which involves a mechanism for identifying the protein surface to help PPI prediction. This method employs the conjoint triad feature  for describing protein sequences and the relaxed variable kernel density estimator (RVKDE)  for classification. Conjoint triads, which treat three continuous amino acids as a single unit, have been shown to be a useful set of features in predicting protein-protein interactions . This work improves this feature set by focusing on conjoint triads at the protein surface. This improvement is based on the assumption that protein-protein interactions are more related to amino acids at the surface than those at the core. To maintain the advantage of depending on only sequence information, this method employs an accurate accessible surface area (ASA) predictor, recently proposed by the authors , to determine the protein surface.
In this study, a collection of 691 PPIs is used to evaluate the prediction performance with and without the proposed mechanism for identifying the protein surface. The experimental results show that the surface information promotes PPI prediction based on feature encoding with conjoint triads. Furthermore, the quality of the predicted surface is analyzed using a number of protein structures collected from the Protein Data Bank (PDB) . The experimental results demonstrate that the performance of PPI prediction achieved using the predicted surface is close to that achieved using the surface obtained from protein structures.
Results and discussion
This section first describes the workflow of the proposed method. Next, the measurements and datasets for performance evaluation are presented. The proposed method is evaluated and compared with another sequence-based PPI predictor. At the end of the section, the predicted surface is compared to those obtained from protein structures.
Proposed PPI prediction scheme
A challenge in preparing protein-protein interaction datasets is the presence of some interactions that are observed in the laboratory experimentation but do not occur physiologically . To ensure the quality of PPI data, an interaction should be consistent with other types of information , such as metabolomic  and gene-gene relationship data . Though these types of data are often incomplete in most organisms at present, the interaction network of transcription factors (TF) of Saccharomyces cerevisiae is an extensively studied system in which all of such information are currently available . Therefore, this study collects 691 interactions of 211 yeast TFs from several studies and databases [32–36] to generate a PPI dataset, SC691. In this dataset, the 691 interactions are used as positive instances, while other protein pairs created by coupling the 211 TFs are used as negative instances.
Evaluation of PPI prediction
In the experiment, the SC691 dataset is randomly split into three subsets of 341, 175 and 175 interacting pairs. These subsets also contain 341, 175 and 175 non-interacting pairs obtained by arbitrarily sampling of the negative instances in the SC691 dataset. Care is taken to ensure that different subsets will not share identical instances. In this experiment, the first subset is used as the training set to predict the other two subsets. The predicted results of the second subset are used for parameter selection, while the predicted results of the third subset indicate the prediction performance of a PPI predictor. Therefore, an evaluation process is performed by first using the first subset to predict the second subset. Then the parameters that maximize the F-measure are used to predict the third subset. Since the procedure for generating these subsets involves randomness, the evaluation process is performed ten times to eliminate the evaluation bias in a single evaluation process.
Performance achieved by considering and by neglecting surface information.
Without surface information
Shen et al.'s work
68.2 ± 4.3
70.4 ± 3.2
66.4 ± 5.1
75.4 ± 5.4
61.0 ± 10.2
Surface identified using different o
72.3 ± 1.4
73.7 ± 1.6
70.3 ± 2.3
77.8 ± 4.4
66.9 ± 5.1
72.1 ± 3.2
74.0 ± 2.2
69.7 ± 4.2
79.3 ± 3.7
64.9 ± 8.3
74.1 ± 2.0
75.5 ± 2.0
71.8 ± 2.4
79.7 ± 3.5
68.6 ± 4.0
71.7 ± 3.8
73.4 ± 2.3
69.8 ± 4.9
77.9 ± 5.9
65.4 ± 11.5
As a result, the average Acc., Fm., Prec., Sens. and Spec. of the developed method are 74.1%, 75.5%, 71.8%, 79.7% and 68.6%, respectively. All five measurements are superior to those delivered by the predictor without surface information. These results show that the proposed mechanism for identifying the protein surface helps to predict protein-protein interactions based on feature encoding with conjoint triads.
Evaluation of predicted surface
Proteins in the SC691 dataset that have structures in PDB
PDB ID: chain
Transcription initiation protein
Galactose/lactose metabolism regulatory protein
RNA polymerase II mediator complex subunit 18
RNA polymerase II mediator complex subunit 20
RNA polymerase II holoenzyme component SRB7
Mitochondrial replication protein
Nonhistone protein 6A
Cyclin, negatively regulates phosphate metabolism
Transcription initiation factor IIA large chain
Transcription initiation factor IIA small chain
Overlap between predicted and structural surface.
85.8 ± 8.4
88.3 ± 7.0
85.7 ± 9.7
91.9 ± 9.0
77.7 ± 15.2
Performance achieved using predicted and structural surface.
96.1 ± 0.6
39.8 ± 1.7
58.5 ± 11.0
31.1 ± 3.5
98.9 ± 0.7
96.2 ± 0.2
40.7 ± 1.3
58.1 ± 6.0
31.5 ± 1.3
99.0 ± 0.3
For Med18, the present method successfully excludes 80 (accounting for ~26.1%) from total 307 residues while preserving 48 (accounting for ~92.3% of the 52) interface residues. As shown in Figure 2(a), most interface residues, specified in yellow, are included. However, for Med20, the proposed method misses 24 (accounting for ~54.5% of the 44) interface residues in the predicted surface in Figure 2(b). Figure 2(b) reveals that the predicted surface misses the segment (residues 86-107) of Med20 that acts like an arm stretching to Med18. A comparison with the interface shown in Figure 2(a) suggests that the present method may perform better at handling flatter interfaces. Since protein subunits may interact and form relatively flat or twisted surfaces , the good performance of the present method probably results from the fact that most of the collected S. cerevisiae TFs have relatively flat surfaces.
These results also reveal that the proposed mechanism for identifying the surfaces of proteins with relatively twisted surfaces must be improved.
An enormous gap exists between the number of protein structures and the huge number of protein sequences. Hence, predicting protein functions directly from amino acid sequences remains one of the most important problems in life science. This work presents a computational approach for PPI prediction based on only sequence information. Notably, a mechanism of extracting surface information is proposed to refine the feature vector for representing a protein sequence. This method is analyzed in terms of a) the performance in predicting PPIs and b) the quality of the predicted surface. The experimental results show that the present method improves on the prediction performance of PPI with an F-measure of 5.1%. Furthermore, the predicted surface of yeast TFs is consistent with that obtained from structures, which encourages applying the present steps of surface identification in other biomedical problems that require similar information.
The second stage encodes a protein sequence based on neighboring solvent accessibility [26, 45]. The i-th residue in a protein sequence is represented as a 2w2+1 dimensional vector v = (ai-h, ti-h, ai-h+1, ti-h+1, ..., a i , t i , ..., ai+h, ti+h, l), where a i is the predicted RSA value of the i-th residue in the first regression, t i is the terminal flag as either 1 (a null/terminal residue) or 0 (otherwise), l is the sequence length and w2 = 2h+1 is window size (w2 = 5 in our implementation).
where K() is the kernel function, and b and w i are numerical parameters determined by minimizing the prediction error on training samples. The problem is to find the support vectors and determine parameters b and w i , which can be solved by constrained quadratic optimization . The LIBSVM package (version 2.86)  is used for SVR implementation in this study.
The employed ASA predictor makes predictions at the residue level. The predicted RSA value of each residue enables surface residues to be defined as those whose RSA values are equal to or larger than a threshold t. These identified surface residues are frequently scattered throughout the protein sequences. This work develops a process for generating a set of surface segments each of which is a consecutive sub-sequence of minimum length. Because a conjoint triad represents three continuous amino acids, these consecutive segments are more suitable than scattered surface residues for being encoded with conjoint triads.
Amino acid groups used herein.
Ala, Gly, Val
Ile, Leu, Phe, Pro
Tyr, Met, Thr, Ser
His, Asn, Gln, Tpr
Relaxed variable kernel density estimator
R(s i ) is the maximum distance between si and its ks nearest training instances;
Γ(·) is the Gamma function ;
β and ks are parameters to be set either through cross-validation or by the user.
where |S j | is the number of class-j training instances, and (v) is the kernel density estimator corresponding to class-j training instances. In this study, j is either 'interacting' or 'non-interacting'. Current RVKDE implementation includes only a limited number, denoted by kt, of the nearest class-j training instances of v while computing (v) in order to improve the efficiency of the predictor. The kt is also a parameter to be set either through cross-validation or by the user.
The authors would like to thank the National Science Council of the Republic of China, Taiwan, for financially supporting this research under Contract Nos. NSC 97-2627-P-001-002, NSC 96-2320-B-006-027-MY2 and NSC 96-2221-E-006-232-MY2. Ted Knoy is appreciated for his editorial assistance.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
- Ge H, Walhout AJM, Vidal M: Integrating 'omic' information: a bridge between genomics and systems biology. Trends Genet 2003, 19(10):551–560. 10.1016/j.tig.2003.08.009View ArticlePubMedGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001, 98(8):4569–4574. 10.1073/pnas.061034498PubMed CentralView ArticlePubMedGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180–183. 10.1038/415180aView ArticlePubMedGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, et al.: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631–636. 10.1038/nature04532View ArticlePubMedGoogle Scholar
- Tong AHY, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli L, Evangelista M, Ferracuti S, Nelson B, Paoluzi S, et al.: A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 2002, 295(5553):321–324. 10.1126/science.1064987View ArticlePubMedGoogle Scholar
- Han JDJ, Dupuy D, Bertin N, Cusick ME, Vidal M: Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotechnol 2005, 23(7):839–844. 10.1038/nbt1116View ArticlePubMedGoogle Scholar
- Hart GT, Ramani AK, Marcotte EM: How complete are current yeast and human protein-interaction networks? Genome Biol 2006, 7(11):120. 10.1186/gb-2006-7-11-120PubMed CentralView ArticlePubMedGoogle Scholar
- Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol 2007, 3(4):e43. 10.1371/journal.pcbi.0030043PubMed CentralView ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96(8):4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Aloy P, Russell RB: InterPreTS: protein Interaction Prediction through Tertiary Structure. Bioinformatics 2003, 19(1):161–162. 10.1093/bioinformatics/19.1.161View ArticlePubMedGoogle Scholar
- Ogmen U, Keskin O, Aytuna AS, Nussinov R, Gursoy A: PRISM: protein interactions by structural matching. Nucleic Acids Res 2005, 33: W331-W336. 10.1093/nar/gki585PubMed CentralView ArticlePubMedGoogle Scholar
- Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402(6757):86–90. 10.1038/47056View ArticlePubMedGoogle Scholar
- Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285(5428):751–753. 10.1126/science.285.5428.751View ArticlePubMedGoogle Scholar
- Huang TW, Tien AC, Lee YCG, Huang WS, Lee YCG, Peng CL, Tseng HH, Kao CY, Huang CYF: POINT: a database for the prediction of protein-protein interactions based on the orthologous interactome. Bioinformatics 2004, 20(17):3273–3276. 10.1093/bioinformatics/bth366View ArticlePubMedGoogle Scholar
- Espadaler J, Romero-Isart O, Jackson RM, Oliva B: Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships. Bioinformatics 2005, 21(16):3360–3368. 10.1093/bioinformatics/bti522View ArticlePubMedGoogle Scholar
- Valencia A, Pazos F: Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 2002, 12(3):368–373. 10.1016/S0959-440X(02)00333-0View ArticlePubMedGoogle Scholar
- Ben-Hur A, Noble WS: Kernel methods for predicting protein-protein interactions. Bioinformatics 2005, 21: I38-I46. 10.1093/bioinformatics/bti1016View ArticlePubMedGoogle Scholar
- Chen XW, Liu M: Prediction of protein-protein interactions using random decision forest framework. Bioinformatics 2005, 21(24):4394–4400. 10.1093/bioinformatics/bti721View ArticlePubMedGoogle Scholar
- Martin S, Roe D, Faulon JL: Predicting protein-protein interactions using signature products. Bioinformatics 2005, 21(2):218–226. 10.1093/bioinformatics/bth483View ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: Predicting protein-protein interactions from sequences in a hybridization space. J Proteome Res 2006, 5(2):316–322. 10.1021/pr050331gView ArticlePubMedGoogle Scholar
- Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, Greenblatt J, Jessulat M, Krogan N, et al.: PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics 2006., 7: 10.1186/1471-2105-7-365Google Scholar
- Shen JW, Zhang J, Luo XM, Zhu WL, Yu KQ, Chen KX, Li YX, Jiang HL: Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci USA 2007, 104(11):4337–4341. 10.1073/pnas.0607879104PubMed CentralView ArticlePubMedGoogle Scholar
- Guo YZ, Yu LZ, Wen ZN, Li ML: Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences. Nucleic Acids Res 2008, 36(9):3025–3030. 10.1093/nar/gkn159PubMed CentralView ArticlePubMedGoogle Scholar
- Shen JW, Zhang J, Luo XM, Zhu WL, Yu KQ, Chen KX, Li YX, Jiang HL: Predictina protein-protein interactions based only on sequences information. Proceedings of the National Academy of Sciences of the United States of America 2007, 104(11):4337–4341. 10.1073/pnas.0607879104PubMed CentralView ArticlePubMedGoogle Scholar
- Oyang YJ, Hwang SC, Ou YY, Chen CY, Chen ZW: Data classification with radial basis function networks based on a novel kernel density estimation algorithm. IEEE Transactions on Neural Networks 2005, 16(1):225–236. 10.1109/TNN.2004.836229View ArticlePubMedGoogle Scholar
- Chang DTH, Huang HY, Syu YT, Wu CP: Real value prediction of protein solvent accessibility using enhanced PSSM features. BMC Bioinformatics 2008, 9(Suppl 12):S12. 10.1186/1471-2105-9-432PubMed CentralView ArticlePubMedGoogle Scholar
- Kirchmair J, Markt P, Distinto S, Schuster D, Spitzer GM, Liedl KR, Langer T, Wolber G: The Protein Data Bank (PDB), Its Related Services and Software Tools as Key Components for In Silico Guided Drug Discovery. Journal of Medicinal Chemistry 2008, 51(22):7021–7040. 10.1021/jm8005977View ArticlePubMedGoogle Scholar
- Dohkan S, Koike A, Takagi T: Improving the Performance of an SVM-Based Method for Predicting Protein-Protein Interactions. In Silico Biol 2006, 6: 515–529.PubMedGoogle Scholar
- Zhu J, Zhang B, Smith EN, Drees B, Brem RB, Kruglyak L, Bumgarner RE, Schadt EE: Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 2008, 40(7):854–861. 10.1038/ng.167PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen J, Oliver S: The next wave in metabolome analysis. Trends Biotechnol 2005, 23(11):544–546. 10.1016/j.tibtech.2005.08.005View ArticlePubMedGoogle Scholar
- Rajagopalan D, Agarwal P: Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics 2005, 21(6):788–793. 10.1093/bioinformatics/bti069View ArticlePubMedGoogle Scholar
- Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia YK, Juvik G, Roe T, Schroeder M, et al.: SGD: Saccharomyces Genome Database. Nucleic Acids Research 1998, 26(1):73–79. 10.1093/nar/26.1.73PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu J, Zhang MQ: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 1999, 15(7–8):607–611. 10.1093/bioinformatics/15.7.607View ArticlePubMedGoogle Scholar
- Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhauser R, et al.: The TRANSFAC system on gene expression regulation. Nucleic Acids Research 2001, 29(1):281–283. 10.1093/nar/29.1.281PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Research 2002, 30(1):31–34. 10.1093/nar/30.1.31PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Consortium U, Bougueleret L, Altairac S, Amendolia V, Auchincloss A, Argoud-Puy G, Axelsen K, Baratin D, Blatter MC, et al.: The Universal Protein Resource (UniProt) 2009. Nucleic Acids Research 2009, 37: D169-D174. 10.1093/nar/gkn664View ArticleGoogle Scholar
- Kabsch W, Sander C: Dictionary of Protein Secondary Structure - Pattern-Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
- Nelson DL, Lehninger AL, Cox MM: Lehninger principles of biochemistry. 5th edition. New York: W.H. Freeman; 2008.Google Scholar
- Kim WK, Ison JC: Survey of the geometric association of domain-domain interfaces. Proteins 2005, 61(4):1075–1088. 10.1002/prot.20693View ArticlePubMedGoogle Scholar
- Kim WK, Henschel A, Winter C, Schroeder M: The many faces of protein-protein interactions: A compendium of interface geometry. Plos Computational Biology 2006, 2(9):e124. 10.1371/journal.pcbi.0020124PubMed CentralView ArticlePubMedGoogle Scholar
- Lise S, Walker-Taylor A, Jones DT: Docking protein domains in contact space. Bmc Bioinformatics 2006., 7: 10.1186/1471-2105-7-310Google Scholar
- Jones S, Thornton JM: Principles of protein-protein interactions. Proceedings of the National Academy of Sciences of the United States of America 1996, 93(1):13–20. 10.1073/pnas.93.1.13PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Nam JW, Shin KR, Han JJ, Lee Y, Kim VN, Zhang BT: Human microRNA prediction through a probabilistic co-learning model of sequence and structure. Nucleic Acids Res 2005, 33(11):3570–3581. 10.1093/nar/gki668PubMed CentralView ArticlePubMedGoogle Scholar
- Nguyen MN, Rajapakse JC: Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins 2006, 63(3):542–550. 10.1002/prot.20883View ArticlePubMedGoogle Scholar
- Witten IH, Frank E: Data mining: practical machine learning tools and techniques. 2nd edition. Amsterdam; Boston, MA: Morgan Kaufman; 2005.Google Scholar
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines.2001. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]Google Scholar
- Artin E: The Gamma Function. New York: Holt, Rinehart and Winston; 1964.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.