Inferring homologous protein-protein interactions through pair position specific scoring matrix
© Lin et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
The protein-protein interaction (PPI) is one of the most important features to understand biological processes. For a PPI, the physical domain-domain interaction (DDI) plays the key role for biology functions. In the post-genomic era, to rapidly identify homologous PPIs for analyzing the contact residue pairs of their interfaces within DDIs on a genomic scale is essential to determine PPI networks and the PPI interface evolution across multiple species.
In this study, we proposed "pair P osition S pecific S coring M atrix (pair PSSM)" to identify homologous PPIs. The pair PSSM can successfully distinguish the true protein complexes from unreasonable protein pairs with about 90% accuracy. For the test set including 1,122 representative heterodimers and 2,708,746 non-interacting protein pairs, the mean average precision and mean false positive rate of pair PSSM were 0.42 and 0.31, respectively. Moreover, we applied pair PSSM to identify ~450,000 homologous PPIs with their interacting domains and residues in seven common organisms (e.g. Homo sapiens, Mus musculus, Saccharomyces cerevisiae and Escherichia coli).
Our pair PSSM is able to provide statistical significance of residue pairs using evolutionary profiles and a scoring system for inferring homologous PPIs. According to our best knowledge, the pair PSSM is the first method for searching homologous PPIs across multiple species using pair position specific scoring matrix and a 3D dimer as the template to map interacting domain pairs of these PPIs. We believe that pair PSSM is able to provide valuable insights for the PPI evolution and networks across multiple species.
Many biological processes involve different types of interactions among proteins. Listing the proteins in the cell is not enough to fully understand the cellular machinery and all the interactions between them need to be delineated as well . Recently, systematic identification of protein-protein interactions (PPIs) had been proposed by high throughput experimental methods (e.g. large scale yeast two-hybrid analysis and tandem affinity purification) and computational methods (e.g. phylogenetic profiles , gene expression profiles , known three-dimensional (3D) complexes , and interologs mapping ). The PPIs for diverse organisms obtained from these methods were being collected by many databases, such as DIP , BIND , MIPS  and STRING . As an increasing number of reliable PPIs become available, a new concept and method to identify homologous PPIs across multiple species has been proposed to understand a newly determined PPI in the post-genomic era .
In the interactions between proteins, the protein domains interact physically with one another to perform the necessary functions. The domain-domain interaction (DDI) can recruit the formation of multi-protein signalling complexes, and control the conformation, activity, and substrate specificity of enzymes . However, almost all large scale methods to explore interacting proteins cannot respond how a protein interacts with another one in molecular detail (which domains bind directly), whether experimental or computational methods. In addition, identification of interacting domain through 3D structural information provides interacting domains and atomic details for thousands of direct physical interactions between proteins [12, 13]. For example, the membrane fusion adaptor p47 forms a tight complex with p97 and mediates p97 binding to its t-SNARE (soluble NSF attachment protein receptor) syntaxin5 for another round of membrane fusion  (Figure S1 in Additional file 1). According to the result of PSI-BLAST, both the two homologous proteins in Saccharomyces cerevisiae (baker's yeast) are very similar (E-value < 10-100 and sequence identity > 30%) to the template in Mus musculus and might interact with each other based on the definition of homologous PPIs . However, the Rix7p lacks the important binding domain (i.e. CDC48_N domain) and not involved in the process of membrane fusion although both p97 and Rix7p belongs to type II AAA+ proteins which containing two ATPase domains. Therefore, the protein pair of Rix7p and Shp1p should be not a homologous PPI of p47 and p97 due to lack of the interacting domains.
The statistical interfacial pair potentials have been used to score how well the query protein pair fit the template structure by previous methods [12, 13]. This is a general empirical matrix for all the dimers of known structures to model the pair of query proteins . However, although binding sites are mainly hydrophobic, protruding, and electrostatic complementary, no general patterns are observed . It had been found that the free energy of binding is not evenly distributed across interfaces; instead, there are hot spots of binding energy made up of a small subset of residues in the interface of complexes . There is a correspondence between the experimental identified energy hot spots and the structurally conserved residues . Recently, many studies had been proven that conservative residues may perform specific functional (e.g. catalysis, recognition, binding) role [17, 18]. Therefore, we consider the general empirical matrix cannot characterize all binding site correctly.
To address these issues, we proposed a new method "p air P osition S pecific S coring M atrix (pair PSSM)" to estimate the probabilities with which residue pairs occur at various contact positions by evolutionary profiles, leading to a more sensitive scoring system. According to our knowledge, pair PSSM is the first method for searching homologous PPIs across multiple species using a 3D dimer as a template and automatically mapping interacting domains for these PPIs. The experimental results demonstrate that pair PSSM can successfully identify the homologous PPIs with 90% accuracy. Moreover, pair PSSM could be applied to search the homologous PPI across seven organisms commonly used in molecular research, including Homo sapiens, M. musculus, Rattus norvegicus, Caenorhabditis elegans, Drosophila melanogaster, S. cerevisiae and Escherichia coli. In these seven organisms, our method infers ~450,000 homologous PPIs in which the interacting domains and residues (binding sites) are automatically modelled. Based on these homologous PPIs, we believe that pair PSSM is able to provide valuable insights for PPI evolution and networks across multiple species.
Materials and methods
Pair position specific scoring matrix (pair PSSM)
Scoring matrix architecture
For a 3D-dimer with the number of contact residue pairs R, the empirical matrix of dimension 20 × 20 is replaced by a protein-protein interaction position-specific matrix (pair PSSM) of dimension R × 20. A typical pair PSSM with 47 (R = 47) contact residue pairs was shown in Figure S2 in Additional file 1. The residue pair in a contact position is considered as a single symbol. The advantage of this matrix is estimation of the probabilities with which residue pairs occur at various contact positions, leading to a more sensitive scoring system.
Construction of evolutionary profile
To obtain the evolutionary profiles from multiple sequence alignment, our alignment result should be come from a comprehensive and non-redundant protein database. The protein database is obtained from the NCBI Reference Sequences database (RefSeq). Although RefSeq aims to provide a non-redundant set of sequences for users, two major sources [i.e. alternative splicing and duplication of genes (paralog)] of redundancy occur in RefSeq. Therefore, we used BlastCluster to remove redundancy with 90% identical over an area covering 90% of the length of each sequence in the same species. Finally, 2,109,945 protein sequences were selected into our non-redundant protein database.
To produce a multiple sequence alignment from the PSI-BLAST, we collected all RefSeq sequence segments aligned to the two proteins of 3D-dimer with E-value ≤ 10-9. The two proteins of 3D-dimer are used as a template for constructing two multiple sequence alignments, respectively. Any sequence that is > 95% identical to 3D-dimer (template) is purged.
Target frequency estimation
Given two multiple sequence alignment derived from a 3D-dimer, we generated score matrices with the theoretical foundation for the score of a specific contact position using the form log (Q ij /P ij ), where Q ij is the estimated probability for the contact residue pair i&j to be found in the column; P ij is the expect probability of i&j to be found in the column. The estimate of Q ij for a specific contact position should converse simply to the observed frequency of the residue pair i&j in that column. To estimate the Q ij , we implemented the data-dependent the pseudocount method  which is relative simplicity and often performs nearly as well comparing the Dirichlet mixtures . The Q ij is given as , where f ij is observed frequency and is the residue pair pseudocount frequency. The S ij is the interacting energy of the residues i&j in empirical matrix. P i and P j are the background probabilities of residues i and j, respectively. The α and β are the relative weights given to observed and pseudocount residue frequency. In our study, α is set to the number of different residue-pair types in column -1 and β = 5. The P ij is defined as P ij = P i × P j . The residue composition of the protein interface is obtained from Lu et al.  (Table S1 in Additional file 1).
Amino acid classification
3D-dimers and related 3D-dimers
Here, we select the co-crystallized proteins from the PDB  with using the following criteria. (1) The resolution of the co-crystallized protein should be smaller than 3.0Å; (2) each chain of the co-crystallized proteins should be comprised more than 35 amino acids. If the protein consists of the cross-chain domain defined by SCOP, we regard it; (3) the number of interacting residue pairs is set to be greater than 25 and each chain must contain more than 5 contact residues to make sure that the dimer is reasonably extensive . Interacting residue pairs are defined as a pair of residues from different chains that have at least one pair of heavy atoms within 4.5Å with each other; (4) elimination of artificial packing complexes rather than biologically functional multimers by using PQS server ; (5) we remove redundancy by the sequence identity > 50%. Finally, the non-redundant template set, termed NRT, consists of 1,122 heterodimers and 3,514 homodimers.
To model PPIs, we explored whether the two similar dimers possess the similar protein interaction type. Here, we defined that two 3D-dimers contain the same interacting SCOP domain (more than three contact residues within the domain boundary) are "related-dimer pairs". Among 29,369 dimers derived from PDB, we first remove the dimers without annotations of SCOP domain then selected 5,553 heterodimers and 15,026 homodimers. Second, the dimers are clustered by BlastCluster  according 80% identical over an area covering 80% of the length of each sequence. We choose one representative dimer from each cluster if the number of interacting residue pairs more than the mean of the cluster and the resolution of crystallization is smallest. Third, the representative dimers are grouped based on the domain definition in SCOP. We group the dimers which possess the same interacting domain pair in family level. Totally, there are 189 groups of heterodimer and 489 groups of homodimer with ≥ 1 member. We choose one representative member for each group and pair the representative one for the all other members in the group. These pairs of dimers are considered as related dimer pairs. In this way, we derived a related 3D-dimer set, termed RD, with 459 and 1,412 related dimer pairs from the 189 groups of heterodimer and 489 groups of homodimer, respectively.
True protein complexes and unreasonable protein pairs
Our method utilized a 3D-dimer to model all candidates of homologous PPIs and used a specific pair PSSM to determine whether two proteins interact. Here, we constructed two data sets, including the protein pairs form complexes (i.e. PPIs) and the protein pairs cannot interact with each other (the unreasonable protein pairs). From 189 representative 3D-dimers (excluding the antibody-antigen complex), we used the PSI-BLAST to remove the similar dimers with > 80% sequence identity and get protein pair with E-value ≤ 0.1. If the protein pair is co-crystallized in PDB and it contains the same interacting SCOP domains for the query dimer, we consider the protein pair as the positive (i.e. true protein complex) as the homologous PPI of the 3D-dimer. In the other word, if the protein pair is not co-crystallized in PDB and it does not contain the same interacting SCOP domain for the query dimer, we considered the protein pair as negative cases (i.e. unreasonable protein pairs). Among 189 representative 3D-dimers, we selected 224 pairs of positive homologous PPIs and 282 pairs of negative protein pairs.
Gold standard positives and negatives of PPIs in S. cerevisiae
To examine the pair PSSM for identifying homologous PPIs in S. cerevisiae, the S. cerevisiae proteome was obtained from the Saccharomyces Genome Database (SGD) . The corresponding amino acid sequences of total 5,877 open reading frames (ORFs) were collected. 5,882 reliable PPs, considered as gold positive set, were downloaded from the core set of Database of Interacting Protein (DIP) . For gold negative set, we followed the previous study  assumed that the proteins in different compartments do not interact with each other and generated 2,708,746 non-interacting protein pairs according to the proteins in separate subcellular compartments .
Yeast gene expression
The gene expression profiles of two interacting proteins were also used to access the accuracy of our method according to the basic assumption: "the gene pair with similar expression profiles is likely to encode an interacting protein pair" . The Rosetta compendium set consisting of the expression profiles of 300 deletion mutants and under chemical treatments  was used to measure the similarity of gene expression profiles of two genes.
To assess the quality of our method, the mean average precision (MAP) and mean false positive rate (MFP) are used in this study. The MAP is defined as , where is the number of protein pair candidates in a hit list containing i positive PPIs; A is total numbers of positive PPIs in the template T; M is the total numbers of template. The MFP is defined as where is the total numbers of protein pair candidates in template T from PSI-BLAST.
In addition, the similarity of two gene-expression data is defined by the Pearson correlation coefficient between the two gene-expression profiles. To test whether the mean of correlation coefficient for candidates of protein-protein interactions higher than that of non-interacting protein pairs, we calculate the T-score and the P-value for the null hypothesis of the sample mean (our prediction) smaller than the mean of gold negative set.
Results and discussion
We first explore the relationship between sequence similarity and interacting similarity for homologous PPIs. Modelling PPIs by homology is reasonable only when the correlation is high enough. Second, the pair PSSM is verified in two data sets: (1) we examined whether the energy calculated from pair PSSM could distinguish the true protein complexes and unreasonable protein pairs; (2) for identified homologous PPIs in yeast proteome through pair PSSM, we used two common metrics (i.e. MAP and MFP) to assess the performance and compared with the empirical matrix used by previous method. Third, the similarity of gene expression profiles for the candidates of protein-protein interactions is examined. Finally, we applied pair PSSM to identify homologous PPIs for seven organisms and used two biological examples to illustrate the operation and power of pair PSSM.
Similar 3D-dimers imply similar interacting types
The RD set (459 pairs of related heterodimers and 1412 pairs of related homodimers) described above provides all instances of a particular interaction type occurring within different complex structures, that we then wish to compare to each other and correlate with sequence similarity. To compare the binding of different instances of the two dimers with the same interacting domains, we used pair coverage (PC) to calculate binding site overlap from the number of shared interacting residue pairs. Given a pair of related dimers A-B and A'-B', where A-A' and B-B' contain same SCOP domains. we use a structural alignment tool, CE , to align the A-A' and B-B', respectively. The PC is defined as , where the NCPM is the matching number of contact residue pairs between the structural alignment of A-A' and B-B'; NCP AB and NCP A'B' are the number contact residues pairs of dimer A-B and A'-B', respectively.
The interacting types of two dimers are very alike when the PC of a related dimer pair is greater than 0.4. In following discussion, the sequence identity between two dimers is defined as the minimum sequence identity in A&A' and B&B'. The rationale is that the interacting partners with the lower sequence identity would tend to be the better indicator for the diversity of the interaction.
In the pairs of related homodimers, the trend which gives a guide to the degree of sequence similarity needed to be confident in a similar interaction is also observed. However, there are more exceptions (~30%) in the pairs of homodimers than heterodimers (Figure S4 in Additional file 1). In addition, the PC s of heterodimers are higher than that of homodimers in difference sequence identity (Figure S5 in Additional file 1). The result means that the specificity of interaction in heterodimer is more conserved than homodimers. For this reason, we think that heterodimers are more suitable for the template to model PPIs than homodimers.
In summary, we find the related dimers indeed keep similar interacting type. Sequence similarity needed to be confident in a similar PPI. We suggest one must be careful with identity below than 30% to model PPIs by homology. Because the specificity of PPI in heterodimer is more conserved than homodimers, we consider the heterodimers are more suitable for the templates to model interactions. Finally, we find that the interacting types of antigen-antibody complexes are often diverse, they may be not suitable for used to as templates.
Verification in true protein complexes and unreasonable protein pairs
To verify pair PSSM, we would likely to study whether the specific interfacial energy calculated from pair PSSM could distinguish the true protein complexes (positive set) and unreasonable protein pairs (negative set). In 224 pairs of positive homologous PPI (i.e. true protein complexes) and 282 pairs of negative protein pair (i.e. unreasonable protein pairs), we used one dimer to score the other one by our specific empirical matrix (pair PSSM) and general empirical matrix. Based on the frequency of positive set and negative set occurred in different specific interfacial energy intervals, we found that the interfacial energy of positive set is significant higher than that of negative set (Figure S6A in Additional file 1). Next, the error rate is calculated by averaging the number of false positive divided by number of positive set and the number of false negative divided by number of negative set. Experimental result shows that the minimum error rate is 18% when the specific interfacial energy (pair PSSM) is set to 50 (Figure S6B in Additional file 1).
In addition, we also applied the general interfacial energy based on empirical matrix to the positive set and negative set (Figure S7 in Additional file 1). When the general interfacial energy is set to 10, we can obtain the minimum error rate 17%. Interestingly, the distributions of positive and negative dataset are not high concentrated in the two sides (positives in high energy and negatives in low energy) when using the specific interfacial energy (Figure S6A in Additional file 1). That is why the error rate higher in using specific interfacial energy than in using general interfacial energy. In addition, the correlation (0.9321) between the specific interfacial energy and the number of contact residues (Figure S8 in Additional file 1) is higher than the correlation (0.6753) between general interfacial energy and the number of contact residues (Figure S9 in Additional file 1).
Because specific interfacial energy is highly dependent on the characteristic of dimer template, we design a method to normalize the specific interfacial energy. When a protein pair is modelled by a 3D-dimer and gets a specific interfacial energy scored by pair PSSM, we normalize the energy defined as the specific interfacial energy of the protein pair divided by the specific interfacial energy of the dimer template. By using the normalized interfacial energy, we can find the distribution of positive and negative dataset is much more concentrated in the two sides (Figure S10A in Additional file 1) than the unnormalized (Figure S6A in Additional file 1) and the error rate reduce from 18% to 13% (Figure S10B in Additional file 1). For this reason, we consider the normalized specific interfacial energy equal set to 0.4 is a good threshold for identifying homologous PPIs.
Verification in yeast proteome
Here, we identified homologous PPIs in S. cerevisiae and used the average precisions and false positive rate, commonly used to evaluate the quality of database searching, to verify our pair PSSM. In the NRT set, 1,122 representative heterodimers were considered as queries to search database of yeast proteome by PSI-BLAST. We defined the proteins searched out with E-value ≤ 10-3 are homologous to the query protein. Given a query of heterodimer A-B, A' and B' are the homologous proteins of A and B, respectively. All the homologous protein pairs A'-B' are considered as candidates of homologous PPIs. Among these candidates, the known interacting protein pairs and the others are considered as positives and negatives, respectively.
In the above results, the candidates which are not included in the known interacting protein pairs are considered as negatives. However, it may be somewhat unreasonable because many candidates are indeed interacting proteins in nature but have not proven by experimental methods in the past. Here, we considered the candidates overlapping with 2,708,746 non-interacting protein pairs defined by Jasen et al. as negatives . The candidates without any annotations are removed for calculate average precisions and false positive rates. In this way, our method using pair PSSM is about 10% improvement than using general empirical matrix (Figure S11 in Additional file 1). In the future, we will compare our pairPSSM with previous works [4, 12, 13] carefully to judge the utilities and advantages for predicting PPIs.
Example analysis: PDB code 1a2kAD
The multiple sequence alignment results between the 14 candidates and their corresponding template, such as A chain of 1a2k (Figure 5C) and D chain of 1a2k (Figure 5D). The interface involves primarily the putative switch II loop of ran (residue 65 to 78) (Figure 5B, orange box) and the hydrophobic cavity surrounding surface of NTF2 . The interaction made by the switch II loops accounts for the ability of NTF2 to discriminate between GDP and GTP bounds forms of Ran. A striking feature of the interacting interface was the aromatic ring of Phe72 of Ran (Figure 5B, orange residue and Figure 5D, orange star site). It inserts into the hydrophobic cavity of NTF2 where it was surrounded by the hydrophobic side chains of Trp41, Leu59, Phe61, Ile64, Leu89, Ala91, Met97, Phe119 and Leu121. The positive protein pairs (i.e. NTF2&GSP1 and NTF2&GSP2) are both conservative in this important site (Figure 5). On the other hand, the interacting interface on NTF2 involved this molecule's characteristic hydrophobic cavity. Hydrophobic residues in the upper portion of the NTF2 cavity, together with negatively charged residues, Glu42, Asp92 and Asp94, are surrounding the cavity (Figure 5A, yellow residues and Figure 5C, yellow boxes) made significant contributions to the interface with GDP-Ran. The three important negatives residues are conservative from A chain of 1a2k (rat NTF2) to the yeast NTF2. However, the three important sites are mutated to Threonine in BRE5 (Figure 5A). The BRE5 is an ubiquitin protease cofactor which forms deubiquitination complex with ubp3p that coregulates anterograde and retrograde transport between the endoplasmic reticulum and Golgi compartments. The three important residues mutated may be resulted in BRE5 does not interact with GSP1 and GSP2. Encouragingly, we give poor score to these two candidates (0.08) and successfully identify the true homologous PPIs (i.e. GSP1&NTF2 and GSP2&NTF2).
Verification in yeast expression profiles
Recently, many scientists consider that genes with similar expression profiles are likely to encode interacting proteins . Therefore, we compare the distribution of gene expression profiles for the two gold standard sets and our identified homologous PPIs by pair PSSM with normalized specific interfacial energies ≥ 0.4 and 0.6 (Figure S12 in Additional file 1). The homologous PPIs composed of the same protein are not used to calculate the gene expression profiles because their expression profiles must be identical and should not be taken account of. Experimental result shows that the distribution of the correlation coefficients of our identified homologous PPIs is similar to the core set of DIP (positives) and right shift to non-interacting protein pairs (negatives) (Figure S12 in Additional file 1). Then, we used standard two sample T-test to test the mean of correlation coefficient for our identified homologous PPIs to non-interacting protein pairs. The E-values of the two sets are 10-30 and 10-26, respectively. These results indicate that the mean of correlation coefficient for identified homologous PPIs by pair PSSM is significant higher than that for non-interacting protein pairs.
Identification of homologous protein-protein interactions in seven common organisms
In the above section, we have verified our pair PSSM in two data sets and obtained a reasonable threshold for normalized specific interfacial energy about 0.4~0.5. Here, we apply pair PSSM to identify homologous PPIs in seven organisms commonly used in molecular research, including H. sapiens, M. musculus, R. norvegicus, C. elegans, D. melanogaster, S. cerevisiae and E. coli. By set the threshold to 0.5, we obtain about 450,000 homologous PPIs from the seven common organisms (Table S5 in Additional file 1). Comparing our identified homologous PPIs and the PPIs deposited in DIP database, there is a large difference for number of interactions in the same organism. For example, we identify 1,850 PPIs in S. cerevisiae but DIP collects 25,165 PPIs. On the other hand, we identify 223,151 PPIs in Homo sapiens but DIP collects only 12,975 PPIs. There are two reason for large drop, one is the large-scale experimental method (such as yeast two-hybrid analysis or proteomics-immunoprecipitation) is hard to apply in mammalian organisms and results in the interactions deposited in DIP are few in H. sapiens, M. musculus, or R. norvegicus. The other reason is gene duplication and alternative splicing often occurred in the mammalian organisms and result in some redundancy protein in protein database. In these organisms, we may over estimate the number of identified PPIs.
Structural genomics projects are generating new structures at an unprecedented rate--a benefit of recent developments in high-throughput technologies . As a result, the number of protein structures in the Protein Data Bank (PDB) is increasing rapidly. For each new determined 3D-dimer, we can apply our method to identify all the candidates in thousands of organisms quickly. It helps the biologists to further detail analysis the PPI network.
Model homologous protein-protein interactions in H. sapiens by 1evtBD
The pair PSSM can apply not only to yeast proteome but also to any other organisms. PPIs between the fibroblast growth factors (FGFs) and their receptors had been intensive studied [41, 42]. FGFs play key roles in morphogenesis, development, angiogenesis, and wound healing. These FGF-simulated processes are mediated by for FGF receptor tyrosine kinase. There are more than 20 human protein FGFs that bind to one or more of 7 FGF receptors (FGFR1b, -1c, -2b, -2c, -3b, -3c, -4), where the c and b denote isoforms IIIc & IIIb formed by alternative splicing. The complex of FGF1/FGFR1 (Figure S13 in Additional file 1) had been dissolved by Plotnikov et al. (PDBID: 1evt) . Ornitz et al. perform a study of FGFR specificity by measuring mitogenic activity of FGFR-inducible BaF3 cell-line . Based on the binding affinity of the seven FGF/receptor complexes (from FGF4 to 7 receptors, FGFR1b, -1c, -2b, -2c, -3b, -3c, -4)  (Table S6 in Additional file 1), we determined that low and high binding affinity relative to the FGF1 are < 10% and > 10%, respectively.
In this study, we used the pair PSSM of 1evtBD to model homologous PPIs for the seven FGF/receptor complexes. Among seven FGF/receptor complexes, six complexes are high affinity and our pairPSSM also give high interfacial energy. However, the other one, FGF4/FGFR3b complexes, with very low binding affinity (1.0%) but our pairPSSM give a high normalized interfacial energy (0.84). For the detailed sequence analysis (Figure S13 in Additional file 1), we find that most contact positions in FGFR3b are much conservative except some residues in D3 immunoglobulin (Ig)-like domains (Figure S13 in Additional file 1, orange box). This result may mean some other factors involved in determining the strength of the FGFR interactions. In conclusion, we successfully identify 6 out of 7 FGF/receptor complexes. There is a good agreement between the specific interfacial energy and binding affinity even though still with an incorrect case.
We have developed a new method "pair PSSM", a more sensitive scoring system for estimating the probabilities with residue pairs occurred at various contact positions by evolutionary profiles, to infer domain annotated homologous PPIs across multiple species. The specific interfacial energy calculated from pair PSSM can successfully distinguish the true protein complexes and non-reasonable protein pair with about 90% accuracy. Experimental results show that the pair PSSM outperforms general empirical matrix about 10% improvements even though for the distantly related protein sequences. Moreover, we applied pair PSSM to identify ~450,000 homologous PPIs, automatically modelled the interacting domains and residues (binding sites), in seven organisms, including H. sapiens, M. musculus, R. norvegicus, C. elegans, D. melanogaster, S. cerevisiae and E. coli. Based on these homologous PPIs, we believe that pair PSSM is able to provide valuable insights for PPI evolution and networks across multiple species.
The publication costs for this article were funded by National Science Council, partial supports of Ministry of Education and National Health Research Institutes [NHRI-EX100-10009PI].
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 2, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S2.
This paper was supported by National Science Council, partial supports of Ministry of Education and National Health Research Institutes [NHRI-EX100-10009PI]; 'Center for Bioinformatics Research of Aiming for the Top University Program' of the National Chiao Tung University and Ministry of Education, Taiwan. We also thanks Core Facility for Protein Structural Analysis supported by National Core Facility.
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417: 399-403.View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A. 1999, 96 (8): 4285-4288. 10.1073/pnas.96.8.4285.PubMed CentralView ArticlePubMedGoogle Scholar
- Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science. 2003, 302 (5644): 449-453. 10.1126/science.1087361.View ArticlePubMedGoogle Scholar
- Chen YC, Lo YS, Hsu WC, Yang JM: 3D-partner: a web server to infer interacting partners and binding models. Nucleic acids research. 2007, 35 (Web Server): W561-567. 10.1093/nar/gkm346.PubMed CentralView ArticlePubMedGoogle Scholar
- Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T: A map of the interactome network of the metazoan C. elegans. Science. 2004, 303 (5657): 540-543. 10.1126/science.1091403.PubMed CentralView ArticlePubMedGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic acids research. 2004, 32 (Database): D449-D451.PubMed CentralView ArticlePubMedGoogle Scholar
- Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E: The biomolecular interaction network database and related tools 2005 update. Nucleic acids research. 2005, 33 (Database): D418-D424.PubMed CentralPubMedGoogle Scholar
- Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V: MIPS: analysis and annotation of proteins from whole genomes. Nucleic acids research. 2004, 32: D41-D44. 10.1093/nar/gkh092.PubMed CentralView ArticlePubMedGoogle Scholar
- von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic acids research. 2005, 33 (Database): D433-D437.PubMed CentralPubMedGoogle Scholar
- Chen CC, Lin CY, Lo YS, Yang JM: PPISearch: a web server for searching homologous protein-protein interactions across multiple species. Nucleic acids research. 2009, 37 (Web Server): W369-375. 10.1093/nar/gkp309.PubMed CentralView ArticlePubMedGoogle Scholar
- Pawson T, Nash P: Assembly of cell regulatory systems through protein interaction domains. Science. 2003, 300 (5618): 445-452. 10.1126/science.1083653.View ArticlePubMedGoogle Scholar
- Aloy P, Russell RB: Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci U S A. 2002, 99 (9): 5896-5901. 10.1073/pnas.092147999.PubMed CentralView ArticlePubMedGoogle Scholar
- Lu L, Lu H, Skolnick J: MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins. 2002, 49 (3): 350-364. 10.1002/prot.10222.View ArticlePubMedGoogle Scholar
- Uchiyama K, Jokitalo E, Kano F, Murata M, Zhang X, Canas B, Newman R, Rabouille C, Pappin D, Freemont P: VCIP135, a novel essential factor for p97/p47-mediated membrane fusion, is required for Golgi and ER assembly in vivo. Journal of Cell Biology. 2002, 159 (5): 855-866. 10.1083/jcb.200208112.PubMed CentralView ArticlePubMedGoogle Scholar
- Keskin O, Ma B, Nussinov R: Hot regions in protein--protein interactions the organization and contribution of structurally conserved hot spot residues:. Journal of Molecular Biology. 2005, 345 (5): 1281-1294. 10.1016/j.jmb.2004.10.077.View ArticlePubMedGoogle Scholar
- Bogan AA, Thorn KS: Anatomy of hot spots in protein interfaces. Journal of Molecular Biology. 1998, 280 (1): 1-9. 10.1006/jmbi.1998.1843.View ArticlePubMedGoogle Scholar
- Ptitsyn OB: Protein folding and protein evolution: common folding nucleus in different subfamilies of c-type cytochromes?. Journal of Molecular Biology. 1998, 278 (3): 655-666. 10.1006/jmbi.1997.1620.View ArticlePubMedGoogle Scholar
- Elcock AH, McCammon JA: Identification of protein oligomerization states by analysis of interface conservation. Proc Natl Acad Sci U S A. 2001, 98 (6): 2990-2994. 10.1073/pnas.061411798.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Altschul SF, Koonin EV: Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A. 1994, 91 (25): 12091-12095. 10.1073/pnas.91.25.12091.PubMed CentralView ArticlePubMedGoogle Scholar
- Henikoff JG, Henikoff S: Using substitution probabilities to improve position-specific scoring matrices. Comput Appl Biosci. 1996, 12 (2): 135-143.PubMedGoogle Scholar
- Lu H, Lu L, Skolnick J: Development of unified statistical potentials describing protein-protein interactions. Biophysical Journal. 2003, 84 (3): 1895-1901. 10.1016/S0006-3495(03)74997-2.PubMed CentralView ArticlePubMedGoogle Scholar
- Saha RP, Bahadur RP, Chakrabarti P: Interresidue contacts in proteins and protein-protein interfaces and their use in characterizing the homodimeric interface. Journal of Proteome Research. 2005, 4 (5): 1600-1609. 10.1021/pr050118k.View ArticlePubMedGoogle Scholar
- Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, Zhang Q, Knezevich C, Xie L, Chen L, Feng Z: The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic acids research. 2005, 33 (Database): D233-D237.PubMed CentralPubMedGoogle Scholar
- Sprinzak E, Margalit H: Correlated sequence-signatures as markers of protein-protein interaction. Journal of Molecular Biology. 2001, 311 (4): 681-692. 10.1006/jmbi.2001.4920.View ArticlePubMedGoogle Scholar
- Henrick K, Thornton JM: PQS: a protein quaternary structure file server. Trends in Biochemical Sciences. 1998, 23 (9): 358-361. 10.1016/S0968-0004(98)01253-5.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Hirschman JE, Balakrishnan R, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hong EL, Livstone MS, Nash R: Genome Snapshot: a new resource at the Saccharomyces Genome Database (SGD) presenting an overview of the Saccharomyces cerevisiae genome. Nucleic acids research. 2006, 34 (Database): D442-D445.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, Liu Y: Subcellular localization of the yeast proteome. Genes and Develop. 2002, 16 (6): 707-719. 10.1101/gad.970902.View ArticleGoogle Scholar
- Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genetics. 2001, 29 (4): 482-486. 10.1038/ng776.View ArticlePubMedGoogle Scholar
- Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD: Functional discovery via a compendium of expression profiles. Cell. 2000, 102 (1): 109-126. 10.1016/S0092-8674(00)00015-5.View ArticlePubMedGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering. 1998, 11 (9): 739-747. 10.1093/protein/11.9.739.View ArticlePubMedGoogle Scholar
- Stewart M, Kent HM, McCoy AJ: Structural basis for molecular recognition between nuclear transport factor 2 (NTF2) and the GDP-bound form of the Ras-family GTPase Ran. Journal of Molecular Biology. 1998, 277 (3): 635-646. 10.1006/jmbi.1997.1602.View ArticlePubMedGoogle Scholar
- Koepp DM, Silver PA: A GTPase controlling nuclear trafficking: running the right way or walking RANdomly?. Cell. 1996, 87 (1): 1-4. 10.1016/S0092-8674(00)81315-X.View ArticlePubMedGoogle Scholar
- Gorlich D, Pante N, Kutay U, Aebi U, Bischoff FR: Identification of different roles for RanGDP and RanGTP in nuclear protein import. The EMBO journal. 1996, 15 (20): 5584-5594.PubMed CentralPubMedGoogle Scholar
- Wong DH, Corbett AH, Kent HM, Stewart M, Silver PA: Interaction between the small GTPase Ran/Gsp1p and Ntf2p is required for nuclear transport. Molecular and Cellular Biology. 1997, 17 (7): 3755-3767.PubMed CentralView ArticlePubMedGoogle Scholar
- Clarkson WD, Corbett AH, Paschal BM, Kent HM, McCoy AJ, Gerace L, Silver PA, Stewart M: Nuclear protein import is decreased by engineered mutants of nuclear transport factor 2 (NTF2) that do not bind GDP-Ran. Journal of Molecular Biology. 1997, 272 (5): 716-730. 10.1006/jmbi.1997.1255.View ArticlePubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J: The Pfam protein families database. Nucleic acids research. 2012, 40 (Database): D290-301.PubMed CentralView ArticlePubMedGoogle Scholar
- Cohen M, Stutz F, Belgareh N, Haguenauer-Tsapis R, Dargemont C: Ubp3 requires a cofactor, Bre5, to specifically de-ubiquitinate the COPII protein, Sec23. Nature cell biology. 2003, 5 (7): 661-667. 10.1038/ncb1003.View ArticlePubMedGoogle Scholar
- Grigoriev A: A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic acids research. 2001, 29 (17): 3513-3519. 10.1093/nar/29.17.3513.PubMed CentralView ArticlePubMedGoogle Scholar
- Todd AE, Marsden RL, Thornton JM, Orengo CA: Progress of structural genomics initiatives: an analysis of solved target structures. Journal of Cell Biology. 2005, 348 (5): 1235-1260.Google Scholar
- Burgess WH, Maciag T: The heparin-binding (fibroblast) growth factor family of proteins. Annu Rev Biochem. 1989, 58: 575-606. 10.1146/annurev.bi.58.070189.003043.View ArticlePubMedGoogle Scholar
- Basilico C, Moscatelli D: The FGF family of growth factors and oncogenes. Advances in cancer research. 1992, 59: 115-165.View ArticlePubMedGoogle Scholar
- Plotnikov AN, Hubbard SR, Schlessinger J, Mohammadi M: Crystal structures of two FGF-FGFR complexes reveal the determinants of ligand-receptor specificity. Cell. 2000, 101 (4): 413-424. 10.1016/S0092-8674(00)80851-X.View ArticlePubMedGoogle Scholar
- Ornitz DM, Xu J, Colvin JS, McEwen DG, MacArthur CA, Coulier F, Gao G, Goldfarb M: Receptor specificity of the fibroblast growth factor family. The Journal of biological chemistry. 1996, 271 (25): 15292-15297. 10.1074/jbc.271.25.15292.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.