- Research article
- Open Access
Probabilistic prediction and ranking of human protein-protein interactions
© Scott and Barton; licensee BioMed Central Ltd. 2007
- Received: 10 May 2007
- Accepted: 05 July 2007
- Published: 05 July 2007
Although the prediction of protein-protein interactions has been extensively investigated for yeast, few such datasets exist for the far larger proteome in human. Furthermore, it has recently been estimated that the overall average false positive rate of available computational and high-throughput experimental interaction datasets is as high as 90%.
The prediction of human protein-protein interactions was investigated by combining orthogonal protein features within a probabilistic framework. The features include co-expression, orthology to known interacting proteins and the full-Bayesian combination of subcellular localization, co-occurrence of domains and post-translational modifications. A novel scoring function for local network topology was also investigated. This topology feature greatly enhanced the predictions and together with the full-Bayes combined features, made the largest contribution to the predictions. Using a conservative threshold, our most accurate predictor identifies 37606 human interactions, 32892 (80%) of which are not present in other publicly available large human interaction datasets, thus substantially increasing the coverage of the human interaction map. A subset of the 32892 novel predicted interactions have been independently validated. Comparison of the prediction dataset to other available human interaction datasets estimates the false positive rate of the new method to be below 80% which is competitive with other methods. Since the new method scores and ranks all human protein pairs, smaller subsets of higher quality can be generated thus leading to even lower false positive prediction rates.
The set of interactions predicted in this work increases the coverage of the human interaction map and will help determine the highest confidence human interactions.
- Protein Pair
- Human Protein Reference Database
- Interaction Dataset
- International Protein Index
- Transitive Module
Protein-protein interactions perform and regulate fundamental cellular processes. The comprehensive study of such interactions on a genome-wide scale will lead to a clearer understanding of diverse cellular processes and of the molecular mechanisms of disease. Although the determination of interactions by small-scale laboratory techniques is impractical for a complete proteome on the grounds of cost and time, several experimental techniques now exist to determine protein-protein interactions in a high-throughput manner . High-throughput datasets have been generated for model organisms such as yeast [2–6], worm  and fly [8, 9] as well as Escherichia coli . In addition, the first broad-focus experimental datasets for the human interactome have recently been published [11, 12]. Interactions determined by high-throughput methods are generally considered to be less reliable than those obtained by low-throughput studies [13, 14] and as a consequence efforts are also underway to extract evidence for interactions from the literature [15–17]. Analysis of the high-throughput datasets has shown that they overlap very little with each other, suggesting that their coverage is low. Indeed, it has been estimated recently that the current yeast and human protein interaction maps are only 50% and 10% complete, respectively .
Predictors based on sequence and structure exploit the observation that some pairs of sequence motifs, domains and structural families tend to interact preferentially. Some methods predict interaction from sequence-motifs found to be over-represented in interacting protein pairs , or by considering the physico-chemical properties and the location of groups of amino acids in the sequence [20, 21]. Others investigate the co-occurrence in interacting proteins of specific protein domains or their structural family classification [22, 23]. When three-dimensional structures are available for both proteins thought to interact, high quality predictions and additional information such as the residues involved in the interaction and their binding affinity may be estimated (reviewed in ). Similarly, when two proteins show clear sequence similarity to proteins that exist in a complex for which the three-dimensional structure is known, detailed predictions of the atomic-level interactions may be made. For example, the major complexes in yeast have been predicted by this strategy .
Predictors based on comparative genomics have been exploited primarily in prokaryotes. They consider the physical location of genes, as well as their pattern of occurrence and evolutionary rate, to predict interactions or functional relationships between protein pairs. Some predictors make use of the observation that neighboring genes whose relative location is conserved across several prokaryotic organisms are likely to interact . Other predictors exploit the observation that gene pairs that co-occur in related species or that co-evolve also tend to be more likely to interact [27–30]. In addition, domains that exist as separate proteins in some genomes but are also seen fused in a single protein in other genomes have been used to suggest the isolated domains may interact [31, 32].
Predictors based on orthology work on the assumption that the orthologs of a protein pair that are known to interact in one organism will also interact. Such relationships are often referred to as interologs . For example, at BLAST e-values below 10-10, it has been shown that 16–30% of yeast interactions can be transferred to the worm  while further studies have estimated that a joint e-value below 10-70 is required to transfer interactions reliably between organisms . Interologs have been used to predict protein-protein interactions in human .
Predictors based on functional features exploit non-sequence information to infer interactions. Some predictors exploit the observation that there is a significant correlation in the expression levels of transcripts encoding proteins that interact . Since proteins must be co-localized in order to interact, protein subcellular localization has often been used to assess the quality of interaction datasets [38, 39]. Similarly, interacting proteins are also often involved in similar cellular processes, so Gene Ontology "process" and "function" annotations have been exploited to predict interactions and validate high-throughput datasets [16, 36, 38].
Predictors have exploited similarities in the network topology of known interaction datasets to predict novel interactions. In one study, the local topology of small-world networks has been used to assess the quality of interaction datasets and predict novel interactions  while Gerstein and colleagues have investigated the prediction of interactions by the identification of missing edges in almost fully connected complexes .
In addition to these diverse approaches, some groups have combined concepts from several of the above categories in integrative frameworks. The first such predictor integrated co-expression data, co-essentiality as well as biological function in a naïve Bayes network to provide proteome-wide de novo prediction of yeast protein interactions . Subsequently, the combination of many more diverse features was investigated using different frameworks to predict yeast protein-protein interactions, increasing the prediction accuracy and allowing an assessment of the limits of genomic integration [42–44]. The integration of diverse genomic features has also been useful in the investigation of the related but broader problem of predicting protein-protein associations as well as complex and pathway membership (see for example ).
Although, many computational methods have investigated the prediction of protein-protein interactions, few have so far been applied to the human proteome. The first large-scale prediction of the human interactome map involved transferring interactions from model organisms . This resulted in over 70000 predicted physical interactions involving approximately 6200 human proteins. A second method integrated expression data, orthology, protein domain data and functional annotations into a probabilistic framework and resulted in the prediction of nearly 40000 human protein interactions . It has recently been estimated that the false-positive rates of these computational datasets as well as of available high-throughput human interaction datasets are, on average, as high as 90% and their coverage is only approximately 10%, indicating that more such efforts are needed to increase the coverage and confidence we have in current maps of the human interactome .
In this paper, the prediction of physical interactions between human proteins has been investigated by integrating in a Bayesian framework several different pieces of evidence including orthology, functional features and local network topology. In order to increase the accuracy and coverage of the predictions, different types of negative data (non-interacting protein pairs) were explored to train the predictor. The most accurate of the predictors was then used to assess the likelihood of pair-wise interaction for over 20000 human proteins from the IPI (International Protein Index) database. These predictions provide a likelihood of interaction for over 260 million human protein pairs and lead to the prediction of over 37000 human interactions. They should thus augment current knowledge of the human interactome as well as the understanding of the relationship between distinct cellular processes.
Architecture of the predictor and training of the modules
Features considered in the prediction of interactions for each module
GDS596 from the Gene Expression Omnibus 
Gene expression profiles from 79 physiologically normal tissues obtained from various sources 
Pearson correlation of co-expression over all conditions
20 of equal size covering the correlation value range (-1 to +1)
InParanoid , BIND , DIP  and GRID  databases
Interactions of homologous protein pairs from yeast, fly, worm and human
Organism-based using InParanoid score
PSLT predictions 
PSLT is a human subcellular localization predictor that considers nine different compartments (ER, Golgi, cytosol, nucleus, peroxisome, plasma membrane, lysosome, mitochondria and extracellular)
Qualitative score: proximity of compartments
4 (same, neighboring, different compartments, or not localized)
InterPro  and Pfam 
Protein domains and motifs
5 covering range of Chi-square scores
HPRD  and UniProt 
4 covering range of PTM scores
VLS2 predictions 
Prediction of protein intrinsic disorder
Sum of the percent disorder for each protein in a pair
6 covering range of scoring function (0 to 200%)
Module that considers local topology of underlying network predicted using combinations of above features
5 covering range of scoring function
The likelihood ratios of interaction are evaluated for each module by considering the relative proportions of positive and negative training examples that have a specific state (i.e. that fall in a particular bin of a module). The datasets used to train the predictor consisted of 26896 known human protein interactions extracted from the Human Protein Reference Database (HPRD)  and approximately 100 times more randomly chosen protein pairs used as negative examples. The composition of the datasets and likelihood ratio calculations are explained in greater detail in the Methods section. Once the final likelihood ratio of interaction (LRfinal) is calculated for a given protein pair as shown in Figure 1B, it is possible to estimate the posterior odds ratio of interaction by multiplying the final likelihood ratio by the prior odds ratio of interaction. Protein pairs that have a posterior odds of interaction above 1 are more likely to interact than not to interact, thus providing an obvious threshold to predict interacting proteins. Estimates for the prior odds ratio of interaction vary. Previous interaction studies on yeast and human use prior odds ratios that range from 1/600 to > 1/400 [37, 43, 46, 47]. The evaluation of this ratio is difficult because not all true interactions are known. As detailed in Methods, the prior odds ratio for human protein interaction was explored by considering different versions and subsets of human interaction datasets. This suggested that there is insufficient data currently available to determine a reliable ratio for human. Accordingly, we selected a prior odds ratio of interaction of 1/400 which is similar to current estimates for yeast and is probably still quite conservative. Thus, the likelihood ratio threshold to predict interactions is 400.
Likelihood ratios of the modules
Figure 1 summarizes the likelihood ratios computed for the five modules. The different modules differ in the range of likelihood ratio values achieved by their different states. The Orthology and Combined modules both have states that achieve likelihood ratios above 400 (as high as 1207 for the Orthology module and 613 for the Combined module), indicating that both these modules can, on their own, predict some interacting protein pairs with a posterior odds ratio above 1.
The Expression module follows trends seen in previous studies with increasing likelihood ratios of interaction reflecting increasing expression correlation [37, 46]. However, since the highest likelihood ratio for the expression datasets that we consider is 33, they are not sufficient on their own to predict interacting protein pairs with a posterior odds ratio above 1. Similarly, but in a much more pronounced way, the Disorder module is only slightly predictive of interaction, with a maximum likelihood ratio of 1.8.
Most states of the Orthology module achieve higher likelihood ratios than the highest obtained by the Expression and Disorder modules. This is not surprising as the transfer of interacting orthologs (known as interologs ) from one organism to another is a popular method to predict interactions (see for example [34, 48]), particularly in the case of organisms like human for which only a small proportion of interactions are known. The direct transfer of interactions to human from either yeast, fly or worm does not alone result in a posterior odds ratio above 1 (as the likelihood ratios of interaction for all yeast, fly and worm bins in the Orthology module are below 400). This is not surprising as previous studies have indicated that quite stringent joint E-values must be used to transfer interactions safely between organisms [34, 35]. In contrast, the consideration of human interactions paralogous to the human protein pairs under investigation results in likelihood ratios of 431 and 1034 (depending on how close the paralogs are as described in Methods) which is much higher than those obtained for any single model organism. This agrees with a recent report that suggested protein-protein interactions are more conserved within species than across species .
Previous methods have investigated the use of co-occurring domains to predict interaction (see for example [23, 46]). Many pairs of domains co-occur in proteins known to interact. When investigated as a separate feature, the chi-square score of co-occurrence of domain pairs correlates well with the likelihood of interaction of protein pairs that contain these domains, with the highest chi-square score bin obtaining a likelihood ratio of 14, as shown in Figure 3A. Similarly, the co-occurrence of PTMs is also predictive of interaction, with its highest scoring bin obtaining a likelihood ratio of 6 as shown in Figure 3B. Lists of high scoring domain pairs and PTM pairs are shown in Additional Files 1 and 2.
Subcellular localization has been extensively used both to assess the quality of interaction datasets [11, 50, 51] and to generate examples of non-interacting protein pairs to use as negative datasets when training and testing predictors [37, 46]. In the present study, the use of localization was investigated as a feature predictive of interaction. Four possible localization states were considered for protein pairs: same compartment, neighboring compartments, different non-neighboring compartments and absence of localization annotation (more details are given in the Methods section). As shown in Figure 3C, the likelihood ratio of same compartment protein pairs was found to be twice as high as that of randomly chosen or non-annotated protein pairs whereas different non-neighboring protein pairs are more than three times less likely to interact than random protein pairs Individual localization features achieve low interaction likelihood ratios. However, when integrated into the Combined module, domain, PTM and localization information together achieve likelihood ratios that are high enough to predict interaction on their own (i.e. above 400). As expected, the highest likelihood ratio bins for the Combined module are those representing the highest combinations of the three features separately.
The transitive module enhances the preliminary likelihood score (PS) (calculated using the group A modules) by considering the local topology of the resulting network which is assessed using the neighborhood topology score as detailed in the Methods section. The likelihood ratios for different values of the neighborhood topology score are shown in Figure 1B. The Transitive module is highly predictive of interaction and achieves likelihood ratios as high as 229. This module cannot be used alone as it requires as input the output of at least one group A module. However, it can predict interacting protein pairs with a posterior odds ratio above 1.0 when used in combination with any single module in group A (as the product of the highest likelihood ratios of the transitive module and any group A module is greater than 400 as can be seen from Figure 1).
Independence of the modules
Pairwise Pearson correlation for all modules
Accuracy of the predictors
All combinations of modules were examined to determine which of the resulting predictors achieved the highest prediction accuracy. In order to analyze the predictions, five-fold cross validation experiments were performed and the area under partial ROC (receiver operator characteristic) curves (partial AUCs) measured. ROC50 and ROC100 curves were selected as they consider a large enough number of positives to include all protein pairs predicted to have a posterior odds ratio above 1.0 by all the predictors investigated. Protein pairs predicted to have a posterior odds ratio below 1.0 have an estimated true positive rate below 50% and thus are more likely not to interact than to interact. These protein pairs are therefore not of interest in this context. The area under all ROCn curves considered is relatively low because of the high proportion of negatives with respect to positives in the training and test sets (100:1).
Prediction accuracy of different combinations of modules
Modules included in prediction
Coverage of the Informative Protein Set (%)
Measures of accuracy
Estimation of number of interactions predicted
posterior odds ratio > 4
posterior odds ratio > 2
posterior odds ratio > 1
As the scores of the predictors increase, so do the number of interactions predicted above different posterior odds ratio thresholds (see lower portion of Table 3). For example, the Expression-Orthology predictor achieves a ROC50 AUC of 0.024 and predicts 5670 interactions at a posterior odds ratio greater than 1 whereas the Expression-Orthology-Combined predictor achieves a ROC50 AUC of 0.044 and predicts over 15000 interactions at a posterior odds ratio above 1. The best combination of Group A modules is the predictor consisting of the Expression, Orthology and Combined modules.
The Transitive module, which can only be used in combination with other modules, increases substantially the scores and number of interactions predicted. The right-hand portion of Table 3 shows the accuracy measures for the highest scoring subset of predictors that consider the Transitive module. The Transitive module enhances the prediction by identifying among protein pairs with a relatively high preliminary score those that are most likely to interact, by considering the local topology of the network around them. For example, the ROC50 AUC rises from 0.044 to 0.075 when the Transitive module is added to the Expression-Orthology-Combined predictor, and the number of predictions above a posterior odds ratio of 1 doubles from 15330 to 34780. Once again, the Disorder module does not contribute positively to the prediction. Its inclusion does not increase any of the measures of accuracy considered. The predictor that considers the Expression, Orthology, Combined and Transitive modules is the one that achieves the highest accuracy overall. It is this predictor that is further analyzed in the next sections.
Comparison to predictions generated using alternative training sets
In this work training sets were used that comprised 100 times more negatives than positives, with the negatives randomly selected and filtered to remove any known or suspected positives (see Methods). Other groups have used negative:positive ratios ranging from 1 to more than 600 (see for example [37, 47, 52]). In addition, several groups use localization-derived negatives (i.e. protein pairs that are not annotated as being localized to the same cellular compartment) rather than randomly chosen negatives (see for example [37, 43, 46]). These issues have been investigated previously .
Since the choice of negative training data may influence the method, the choice of different training sets in the context of the probabilistic predictor presented here was investigated to determine which type of training set offers the highest accuracy.
Influence of the negative:positive training set ratio on the prediction accuracy
Neg:pos testing ratio
ROC50 AUC (std)a
ROC100 AUC (std)a
ROC50 AUC (std)a
ROC100 AUC (std)a
Neg:pos training ratio
The effect of localization-derived negatives rather than randomly chosen negatives was also investigated to see if it would increase the prediction accuracy. A criticism of randomly chosen negatives is that they will contain some true interactors. However, the set of interacting pairs in the full protein pair space is small and thus the contamination rate of randomly chosen negative datasets will in fact be very low. Contamination is probably below 1%, which is likely lower than the contamination rate of the positive dataset as discussed in . Localization-derived negatives, on the other hand, should be free of contamination, if the localization annotations are complete and accurate, both conditions that are difficult to obtain as discussed in . However, one can argue that localization-derived negatives might not be able to capture the full diversity of the non-interacting protein space since many proteins in the same cellular compartment do not interact. In addition, proteins specific to a cellular compartment may have different characteristics to proteins in other compartments. Such predictors may not generalize well when predicting on cell-wide protein pairs which consist not only of non-colocalized non-interacting pairs but also numerous protein pairs that do not interact but are present in the same cellular compartment. These issues have been discussed previously . In order to see if different types of negatives could influence the accuracy of the predictors developed here we generated negative training/test sets as in  by identifying all pairs of human proteins for which one protein is annotated as being nuclear and the other is annotated as being localized to the plasma membrane in the HPRD database . The Combined module for these predictors only considers domains and PTMs but not subcellular localization as this would result in using this feature both in the selection of the training set and as a feature predictive of interaction. The localization-derived negative trained predictor tested on sets containing localization-derived negatives achieves a lower accuracy than that of the random negative trained predictor tested on a test set containing randomly-generated negatives (0.0686 +/- 0.0010 vs 0.0747 +/- 0.0022). This is most likely due to the fact that the localization-derived negative trained predictor cannot take full advantage of the Transitive module, since the network resulting from the predictions of the Group A modules likely does not sample the whole protein pair space well.
Our predictor trained with randomly generated negatives and a negative:positive ratio of 100 performs the best out of all the combinations of training sets investigated. It is this predictor that is further analyzed in subsequent sections.
Contribution of the modules
The relative contribution of the modules to the prediction of interaction was investigated in order to gain a better understanding of the predictive power and areas of highest usefulness of the different modules. To do this, all protein pairs were considered that achieve an estimated posterior odds ratio > 1 when the EOCT predictor was trained on the full datasets without cross-validation. This set consists of 37606 distinct predicted interactions and is referred to as the LR400 dataset (all these interactions are listed and ranked in Additional File 3). These protein pairs represent the most probable interactors with respect to the features considered, among all protein pairs examined by the predictor.
Comparison to other interaction datasets
In Figure 6B and 6C, we compare the number of distinct proteins and distinct interactions of the LR400 dataset to those of the Rhodes prediction dataset and the June 2006 version of the HPRD which was used to train our predictor. The Rhodes dataset was trained using an earlier version of the HPRD. As can be seen in Figure 6, the intersections between the three datasets considered are low, especially when comparing the interactions. Both the Rhodes dataset and our LR400 dataset predict interactions involving many proteins that are not even present in their positive training set (the HPRD). Many of the predictions in these two datasets concern protein pairs and proteins that are not present in other datasets, suggesting that they cover different regions of the human interaction space. As suggested in , by making more such datasets available, it will be possible to increase our coverage of the interaction space and determine the most likely human interactions.
Another human interaction dataset has recently become available: the IntNetDB . It was generated by integrating seven different features (four of which involve transferring interactions or characteristics of protein pairs from model organisms to human) in a probabilistic framework. Interactions were predicted above a TP/FP ratio (number of true positives divided by the number of false positives in the test set) of 1. Using such a threshold, the authors claim to predict 180 010 human interactions. We do not compare our predictions to this dataset because such a threshold of TP/FP > 1 does not correspond to a posterior odds threshold > 1. Depending on the positive-to-negative ratio used in the datasets, TP/FP > 1 might correspond to an average posterior odds ratio of 1. In contrast, the average posterior odds ratio of our LR400 dataset is above 700. In comparison, by using a threshold of TP/FP > 1 in our test set, we predict over 1 000 000 human interactions. We do not believe that the quality of this large number of predictions is high enough to warrant their publication since the great majority of these protein pairs achieve a posterior odds ratio below 1.
TCPTP was predicted to interact with STAT6 at a posterior odds ratio of 4300. It has been recently reported that TCPTP, the only protein tyrosine phosphatase known to localize to the nucleus, dephosphorylates STAT6 in this cellular compartment, which may in turn lead to the suppression of Interleukine-4 (IL-4) induced signaling .
N-WASP and ARP3 achieve a predicted posterior odds ratio of interaction of 2700. A recent report suggested that the IQGAP1 protein can activate N-WASP thus changing its conformation and allowing it to bind the ARP2/3 complex, which in turn directs the generation of branched actin filaments required for the extension of a lamellipodium .
The VAMP3-VTI1A interaction was predicted with a posterior odds ratio of 1518. Both these proteins are believed to be part of the SNARE (soluble N-ethylmaleimide-sensitive factor attachment protein receptor) family of proteins which are involved in membrane fusion events. VTI1A is a trans-Golgi-network-localized putative t-SNARE  and VAMP3 is an early/recycling endosomal v-SNARE . These two proteins were recently shown to interact, leading to their functional implication in the post-Golgi retrograde transport step .
CDK2 and MCM4 were predicted to interact at a posterior odds ratio of 62. CDK2 has recently been shown to phosphorylate MCM4, a subunit of a putative replicative helicase essential for DNA replication, on two distinct residues, leading to a change in its affinity to chromatin and its enrichment in the nucleolus .
Sam68 and Smad2 achieve a predicted posterior odds ratio of 32. This interaction has been experimentally demonstrated by large-scale yeast-two-hybrid analysis of the Smad signaling system .
Our probabilistic predictor therefore not only reproduces and completes well-known protein complexes but also identifies novel interactions, a subset of which have been independently validated.
The current human protein interaction map is estimated to be only 10% complete . Here, we investigated the prediction of human protein-protein interactions in an effort to increase the coverage of the human interactome while simultaneously providing high quality predictions. By considering several different types of orthogonal and quite distinct features including expression, orthology, combined protein characteristics and local network topology, we predicted over 37000 human protein interactions and explored a subspace of the human interactome that has not been investigated by previous large interaction datasets. Our investigation led us to compare the influence of different training sets on the prediction accuracy. The use of randomly generated negative training examples and large negative-to-positive ratios in the training set generated the most accurate predictors in the context of our model. A comparison to other large human interaction datasets revealed the average false positive rate of our dataset to be 76%, which is much lower than the overall average for most large scale, currently available, human interaction datasets (experimental and computational) estimated to be 90% . A subset of our novel predictions have been independently validated by identifying recent reports that experimentally investigated and confirmed that these protein pairs do interact. We provide all our predictions ranked according to the posterior odds ratio of interaction in Additional File 3. It is thus possible to restrict the dataset to the highest scoring protein pairs (and only choose for example, protein pairs that have an estimated true positive rate of interaction above 90%). By making this human interaction prediction dataset publicly available, it is our hope that it will help to identify the most high-confidence interactions, leading to a more complete and accurate human interaction map.
In order to investigate the likelihood of interaction of human proteins, 62322 human protein sequences were downloaded from the International Protein Index (IPI) database (version 3.16) . Some of these proteins are alternative transcripts of the same gene but can have distinct interaction partners. Known interactions were downloaded from the Human Protein Reference Database (HPRD; June 2006 version) . Duplicate interactions and self-interactions were not considered. Additionally, some proteins were not recovered in the conversion between different identifiers. This resulted in 26896 distinct human protein interactions involving 7531 distinct human proteins present in the initial IPI dataset. The 26896 interactions from the June 2006 version of the HPRD were used as the positive dataset in the training/testing of the predictor. Two different sets of non-interacting protein pairs were investigated: the main analysis employed a randomly-generated negative dataset but this was also compared to a localization-derived negative dataset. Both non-interacting protein datasets were cleaned by removing all protein pairs that came from the positive dataset as well as protein pairs that were annotated as interacting in other databases (DIP : 679 interactions, BIND : 2650 interactions), or predicted to interact in other studies (OPHID : 21815 interactions).
Of the 62322 human proteins from the initial IPI dataset, 22889 were characterized by at least one of the features that we considered to predict interaction (see the Features section). These 22889 human proteins are encoded by 16904 distinct genes and are referred to as the Informative Protein Set. The randomly-generated negative dataset used for the training and testing of the predictor was created by selecting protein pairs at random from the Informative Protein Set. In contrast, the localization-derived negative dataset was created by selecting protein pairs from the Informative Protein Set for which the HPRD  annotates one as being primarily in the plasma membrane and the other as primarily in the nucleus. Training and testing was performed with 5-fold cross-validation. In addition, positive to negative ratios of 1:1 and 1:100 were considered.
The predictions were compared to the literature-mined Ramani dataset , the orthology-derived Lehner prediction dataset  and the probabilistic Rhodes prediction dataset . All three datasets identify the interactions by stating the names and/or gene locus IDs of the genes that encode the interacting proteins. In contrast, we work directly on the protein sequences and so related the gene annotations to our protein identifiers by extracting Entrez Gene IDs corresponding to the IPI protein entries from the IPI cross-reference files (for the IPI release 3.24) . Ensembl gene identifiers (Ensembl 42) were also matched to Entrez Locus IDs (NCBI36) using BioMart .
Some gene-gene entries were not recovered in the conversion between different identifiers, or due to the deletion or replacement of some Entrez Locus IDs. Despite this, 37714 gene-gene interactions were recovered from the Rhodes dataset and 6132 interactions from the Ramani dataset as well as 64306 and 10454 interactions from the Lehner full and core datasets respectively.
Semi-naïve Bayes classifiers were used to measure the likelihood of interaction of two proteins given the presence of the features considered. This learning method was chosen because it allows the integration of highly heterogeneous data in a model that is easy to interpret and that can readily accommodate missing data. The transparency of the method allows the straightforward determination of which features are most predictive of interaction at the level of the whole proteome as well as for individual protein pairs.
where I is a binary variable representing interaction, ~I represents non-interaction, f1 through fn are the features we are considering, Oprior is the prior odds ratio and LR is the likelihood ratio.
The likelihood ratios for the different features considered can be estimated by evaluating the ratio of the proportion of interacting and non-interacting proteins for which a particular state of the feature is true in the training set (i.e. by determining to which bin of the feature the protein pair belongs, for every protein pair in the positive and negative training sets). More precisely, the training step consisted of calculating the respective proportions of positive and negative examples that fall into each bin of the feature(s) considered (i.e. that have a particular state). The likelihood ratio of interaction for a given state is simply the ratio of the proportion of all positives that have that state divided by the proportion of all negatives that have that same state. When a particular state of a feature occurs only in positive examples (known interacting proteins), the likelihoods are set to the highest non-infinite value of any state for that feature (to avoid infinite values). Additionally, when no data are available for a specific feature (for a given pair of proteins), the likelihood of the feature is set to 1.0. For a detailed calculation of the likelihoods see Additional File 4.
Prior odds ratio estimate
The prior odds ratio (Oprior) is difficult to estimate because we do not know all the true interactions, even for a small subset of proteins. The prior odds ratio of interaction for yeast was estimated by combining all protein-protein interactions (but only those related to direct physical interactions, and no entries derived by synthetic lethal-type experiments) from the BIND, DIP and GRID databases [65, 66, 69]. This subset of interactions contains 36466 distinct interactions involving 5202 distinct proteins, thus resulting in a prior odds ratio of 1/370. This is most likely a conservative estimate since a certain proportion of interactions remain unknown and so when more data become available, the prior odds ratio will increase. For human proteins, 12191 distinct interactions were recovered, involving 5164 human proteins from the September 2005 version of the HPRD  and 26896 distinct interactions involving 7531 human proteins from the June 2006 version, leading respectively to prior odds estimates of 1/1093 and 1/1053. However, taking the subset of 5164 proteins from the September 2005 version that are seen in the June 2006 version (20842 distinct interactions), gave a prior odds of interaction estimate of 1/639. Thus, between the two releases of the HPRD, there was a large increase in the number of interactions for this subset of proteins and this is likely to continue for at least the next few releases. Accordingly, it is reasonable to conclude that there are not enough known human interactions to calculate a realistic and stable estimate of the prior odds ratio of interactions for human. As a consequence, a prior odds ratio of 1/400 was used for all work in the paper, which is similar to the estimate for yeast and is likely still an underestimate of the true value.
Seven distinct features combined into five modules were investigated as summarized in Table 1 and described below.
1. Expression module
Expression data were downloaded from the Gene Expression Omnibus . The GDS596 dataset was used which examines gene expression profiles from 79 physiologically normal tissues obtained from various sources . Expression data were recovered for 10642 distinct transcripts in 158 different arrays (2 arrays per tissue). Pearson correlations were calculated for all 56620761 transcript pairs and correlation values were grouped into 20 bins of increasing co-expression.
2. Orthology module
Orthology maps between human and yeast, worm and fly were downloaded from the InParanoid database . Interaction datasets for model organisms were downloaded from the BIND , DIP  and GRID  databases. Orthology interaction data were classified into 13 bins. High, medium and low confidence bins were defined for human protein pairs that have interacting orthologs in either yeast, fly or worm (for a total of 9 bins). The high confidence bins were populated by human protein pairs that have interacting orthologs that both achieve an InParanoid score of 1 (i.e. both proteins involved in an interaction in another organism are respectively the best orthology match for the two human proteins under consideration). The medium confidence bins were populated by human protein pairs that have interacting orthologs but only one of the interacting orthologs has an InParanoid score of 1. The low confidence bins were filled by human protein pairs that have interacting orthologs according to InParanoid but neither achieves a score of 1 (i.e. neither is the best match for the two human proteins under consideration). The orthology module has four additional bins: two bin for human pairs that have interacting paralogs in human (a medium and a low confidence bin which use the same definition as above for the model organisms), one bin for human pairs that have interacting homologs in more than one organism (these can be orthologs in yeast, worm or fly, or paralogs in human) and one bin for human pairs that have only non-interacting orthologs.
3. Combined module
This module incorporates three distinct features in a non-naïve Bayesian framework: subcellular localization, domain co-occurrence and post-translational modification co-occurrence.
PSLT (Protein Subcellular Localization Tool) subcellular localization predictions  were used to classify protein pairs in one of four groups: pairs of proteins predicted to be in the same compartment, pairs of proteins predicted to be in neighboring compartments (cytosol-nucleus, endoplasmic reticulum-Golgi, Golgi-cytosol, cytosol-plasma membrane, and plasma membrane-secreted), pairs of proteins predicted in different non-neighboring compartments and pairs of proteins for which there were no localization predictions. Neighboring compartments were chosen as compartment pairs sharing a high proportion of proteins, as investigated previously .
Co-occurrence of domains
The chi-square test was used as a measure of the likelihood of co-occurrence of specific InterPro domains and motifs  in protein pairs. Chi-square scores were calculated for all pairs of domains/motifs that occurred in the training data and were then grouped into 5 bins of increasing value. Additionally, Pfam  domain pairs known to interact from three-dimensional structures  were included in the highest Chi-square score bin. When protein pairs contained more than one domain pair, the domain pair assigned to the highest Chi-square score bin was used to assign a likelihood of interaction.
Post-translational modification (PTM) pair co-occurrence
where PTM[i] and PTM[j] are distinct PTMs and I is the set of all interacting proteins that were used to train the predictor. The annotations of PTMs in human proteins were downloaded from UniProt  and HPRD . PTM instances described as "predicted", "probable" or "possible" were excluded, leaving 3439 distinct proteins with PTM annotations in the training set. The PTM pair enrichment scores were grouped into 4 bins of increasing value.
The localization, co-occurrence of domains, and PTMs were considered simultaneously to measure their predictive power in assessing the likelihood of protein interaction. To do this, all possible combinations of the 4 localization bins, 5 chi-square domain-co-occurrence bins and 4 PTM_score bins were investigated and are referred to as the combined module.
4. Disorder module
It has been suggested that unstructured regions of proteins are often involved in binding interactions, particularly in the case of transient interactions . Protein intrinsic disorder was predicted for all proteins considered by using the VSL2B predictor . The disorder score for protein pairs was then calculated as the sum of percent disorder for each protein of the pair. Disorder scores were grouped into 6 bins of increasing value.
The Expression, Orthology, Combined and Disorder modules are referred collectively as the Group A modules. Likelihood ratios for each of the Group A modules are illustrated in Figure 1A (see Additional File 4 for complete likelihood ratios for every possible state of these modules and for detailed calculations of these likelihood ratios).
5. Transitive module
where Ec is the set of edges that connect proteins i and j to their common interactors, Ei is the set of edges that involve protein i, se is the score (likelihood ratio) of edge e and Ei\Ec refers to the set difference of Ei and Ec. For a given set of neighbors, T increases as the interactions with these neighbors become more likely (as the sum of se increases). Additionally, the topology score T of a pair of proteins increases as the proportion of likely interactors that these two proteins share increases. The topology scores were grouped into 5 bins of increasing value. It should be noted that the neighborhood topology score calculated for a given protein pair does not consider the preliminary score assigned to that protein pair. It only considers the preliminary scores of its neighbors and so is truly based on the local network topology around that protein pair. Accordingly, the likelihood ratio the transitive module outputs for a given protein pair is independent of the likelihood ratio calculated by the Group A modules for this same protein pair.
The Pearson correlation between pairs of modules was estimated by taking 150 samples of 10000 protein pairs each and calculating the Pearson correlation of the likelihood ratios for the two modules considered, for each sample. The reported correlation values are the average of the 150 experiments. Samples of the protein pair space were taken instead of considering the whole space as this was more computationally tractable.
where i takes on values from 1 to n, T is the total number of positives in the test set and Ti is the number of positives that score higher than the ith highest scoring negative.
We would like to thank Dr Tom Walsh for technical support as well as Drs James Procter and Emily Jefferson for helpful discussions. MSS is a recipient of a post-doctoral fellowship from the Canadian Institutes of Health Research (CIHR).
- Xia Y, Yu H, Jansen R, Seringhaus M, Baxter S, Greenbaum D, Zhao H, Gerstein M: Analyzing cellular biochemistry in terms of molecular networks. Annu Rev Biochem 2004, 73: 1051–1087. 10.1146/annurev.biochem.73.011303.073950View ArticlePubMedGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141–147. 10.1038/415141aView ArticlePubMedGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180–183. 10.1038/415180aView ArticlePubMedGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001, 98(8):4569–4574. 10.1073/pnas.061034498PubMed CentralView ArticlePubMedGoogle Scholar
- Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrin-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae . Nature 2006, 440(7084):637–643. 10.1038/nature04670View ArticlePubMedGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae . Nature 2000, 403(6770):623–627. 10.1038/35001009View ArticlePubMedGoogle Scholar
- Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans . Science 2004, 303(5657):540–543. 10.1126/science.1091403PubMed CentralView ArticlePubMedGoogle Scholar
- Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, Jacq B, Arpin M, Bellaiche Y, Bellusci S, Benaroch P, Bornens M, Chanet R, Chavrier P, Delattre O, Doye V, Fehon R, Faye G, Galli T, Girault JA, Goud B, de Gunzburg J, Johannes L, Junier MP, Mirouse V, Mukherjee A, Papadopoulo D, Perez F, Plessis A, Rosse C, Saule S, Stoppa-Lyonnet D, Vincent A, White M, Legrain P, Wojcik J, Camonis J, Daviet L: Protein interaction mapping: a Drosophila case study. Genome Res 2005, 15(3):376–384. 10.1101/gr.2659105PubMed CentralView ArticlePubMedGoogle Scholar
- Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster . Science 2003, 302(5651):1727–1736. 10.1126/science.1090289View ArticlePubMedGoogle Scholar
- Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A: Interaction network containing conserved and essential protein complexes in Escherichia coli . Nature 2005, 433(7025):531–537. 10.1038/nature03239View ArticlePubMedGoogle Scholar
- Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437(7062):1173–1178. 10.1038/nature04209View ArticlePubMedGoogle Scholar
- Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122(6):957–968. 10.1016/j.cell.2005.08.029View ArticlePubMedGoogle Scholar
- Sprinzak E, Sattath S, Margalit H: How reliable are experimental protein-protein interaction data? J Mol Biol 2003, 327(5):919–923. 10.1016/S0022-2836(03)00239-0View ArticlePubMedGoogle Scholar
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417(6887):399–403. 10.1038/nature750View ArticlePubMedGoogle Scholar
- Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R, Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R, Deepthi P, Mohan SS, Gandhi TK, Harsha HC, Deshpande KS, Sarker M, Prasad TS, Pandey A: Human protein reference database--2006 update. Nucleic Acids Res 2006, 34(Database issue):D411–4. 10.1093/nar/gkj141PubMed CentralView ArticlePubMedGoogle Scholar
- Ramani AK, Bunescu RC, Mooney RJ, Marcotte EM: Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol 2005, 6(5):R40. 10.1186/gb-2005-6-5-r40PubMed CentralView ArticlePubMedGoogle Scholar
- Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M: Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. J Biol 2006, 5(4):11. 10.1186/jbiol36PubMed CentralView ArticlePubMedGoogle Scholar
- Hart GT, Ramani AK, Marcotte EM: How complete are current yeast and human protein-interaction networks? Genome Biol 2006, 7(11):120. 10.1186/gb-2006-7-11-120PubMed CentralView ArticlePubMedGoogle Scholar
- Martin S, Roe D, Faulon JL: Predicting protein-protein interactions using signature products. Bioinformatics 2005, 21(2):218–226. 10.1093/bioinformatics/bth483View ArticlePubMedGoogle Scholar
- Chinnasamy A, Mittal A, Sung WK: Probabilistic prediction of protein-protein interactions from the protein sequences. Comput Biol Med 2006, 36(10):1143–1154. 10.1016/j.compbiomed.2005.09.005View ArticlePubMedGoogle Scholar
- Bock JR, Gough DA: Predicting protein--protein interactions from primary structure. Bioinformatics 2001, 17(5):455–460. 10.1093/bioinformatics/17.5.455View ArticlePubMedGoogle Scholar
- Park J, Lappe M, Teichmann SA: Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J Mol Biol 2001, 307(3):929–938. 10.1006/jmbi.2001.4526View ArticlePubMedGoogle Scholar
- Sprinzak E, Margalit H: Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 2001, 311(4):681–692. 10.1006/jmbi.2001.4920View ArticlePubMedGoogle Scholar
- Szilagyi A, Grimm V, Arakaki AK, Skolnick J: Prediction of physical protein-protein interactions. Phys Biol 2005, 2(1–2):S1-S16. 10.1088/1478-3975/2/2/S01View ArticlePubMedGoogle Scholar
- Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin AC, Bork P, Superti-Furga G, Serrano L, Russell RB: Structure-based assembly of protein complexes in yeast. Science 2004, 303(5666):2026–2029. 10.1126/science.1092645View ArticlePubMedGoogle Scholar
- Huynen M, Snel B, Lathe W 3rd, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 2000, 10(8):1204–1210. 10.1101/gr.10.8.1204PubMed CentralView ArticlePubMedGoogle Scholar
- Gaasterland T, Ragan MA: Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb Comp Genomics 1998, 3(4):199–217.PubMedGoogle Scholar
- Goh CS, Bogan AA, Joachimiak M, Walther D, Cohen FE: Co-evolution of proteins with their interaction partners. J Mol Biol 2000, 299(2):283–293. 10.1006/jmbi.2000.3732View ArticlePubMedGoogle Scholar
- Pazos F, Valencia A: Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng 2001, 14(9):609–614. 10.1093/protein/14.9.609View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402(6757):86–90. 10.1038/47056View ArticlePubMedGoogle Scholar
- Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285(5428):751–753. 10.1126/science.285.5428.751View ArticlePubMedGoogle Scholar
- Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thierry-Mieg N, Vidal M: Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 2000, 287(5450):116–122. 10.1126/science.287.5450.116View ArticlePubMedGoogle Scholar
- Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res 2001, 11(12):2120–2126. 10.1101/gr.205301PubMed CentralView ArticlePubMedGoogle Scholar
- Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JD, Bertin N, Chung S, Vidal M, Gerstein M: Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 2004, 14(6):1107–1118. 10.1101/gr.1774904PubMed CentralView ArticlePubMedGoogle Scholar
- Lehner B, Fraser AG: A first-draft human protein-interaction map. Genome Biol 2004, 5(9):R63. 10.1186/gb-2004-5-9-r63PubMed CentralView ArticlePubMedGoogle Scholar
- Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302(5644):449–453. 10.1126/science.1087361View ArticlePubMedGoogle Scholar
- Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1(5):349–356. 10.1074/mcp.M100037-MCP200View ArticlePubMedGoogle Scholar
- Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju S, Periaswamy B, Mishra G, Nandakumar K, Shen B, Deshpande N, Nayak R, Sarker M, Boeke JD, Parmigiani G, Schultz J, Bader JS, Pandey A: Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 2006, 38(3):285–293. 10.1038/ng1747View ArticlePubMedGoogle Scholar
- Goldberg DS, Roth FP: Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci U S A 2003, 100(8):4372–4376. 10.1073/pnas.0735871100PubMed CentralView ArticlePubMedGoogle Scholar
- Yu H, Paccanaro A, Trifonov V, Gerstein M: Predicting interactions in protein networks by completing defective cliques. Bioinformatics 2006, 22(7):823–829. 10.1093/bioinformatics/btl014View ArticlePubMedGoogle Scholar
- Ben-Hur A, Noble WS: Kernel methods for predicting protein-protein interactions. Bioinformatics 2005, 21 Suppl 1: i38–46. 10.1093/bioinformatics/bti1016View ArticlePubMedGoogle Scholar
- Lu LJ, Xia Y, Paccanaro A, Yu H, Gerstein M: Assessing the limits of genomic data integration for predicting protein networks. Genome Res 2005, 15(7):945–953. 10.1101/gr.3610305PubMed CentralView ArticlePubMedGoogle Scholar
- Jaimovich A, Elidan G, Margalit H, Friedman N: Towards an integrated protein-protein interaction network: a relational Markov network approach. J Comput Biol 2006, 13(2):145–164. 10.1089/cmb.2006.13.145View ArticlePubMedGoogle Scholar
- Myers CL, Robson D, Wible A, Hibbs MA, Chiriac C, Theesfeld CL, Dolinski K, Troyanskaya OG: Discovery of biological networks from diverse functional genomic data. Genome Biol 2005, 6(13):R114. 10.1186/gb-2005-6-13-r114PubMed CentralView ArticlePubMedGoogle Scholar
- Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM: Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 2005, 23(8):951–959. 10.1038/nbt1103View ArticlePubMedGoogle Scholar
- Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 2006, 63(3):490–500. 10.1002/prot.20865PubMed CentralView ArticlePubMedGoogle Scholar
- Kemmer D, Huang Y, Shah SP, Lim J, Brumm J, Yuen MM, Ling J, Xu T, Wasserman WW, Ouellette BF: Ulysses - an application for the projection of molecular interactions across species. Genome Biol 2005, 6(12):R106. 10.1186/gb-2005-6-12-r106PubMed CentralView ArticlePubMedGoogle Scholar
- Mika S, Rost B: Protein-protein interactions more conserved within species than across species. PLoS Comput Biol 2006, 2(7):e79. 10.1371/journal.pcbi.0020079PubMed CentralView ArticlePubMedGoogle Scholar
- Jonsson PF, Cavanna T, Zicha D, Bates PA: Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinformatics 2006, 7: 2. 10.1186/1471-2105-7-2PubMed CentralView ArticlePubMedGoogle Scholar
- Lim J, Hao T, Shaw C, Patel AJ, Szabo G, Rual JF, Fisk CJ, Li N, Smolyar A, Hill DE, Barabasi AL, Vidal M, Zoghbi HY: A protein-protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration. Cell 2006, 125(4):801–814. 10.1016/j.cell.2006.03.032View ArticlePubMedGoogle Scholar
- Ben-Hur A, Noble WS: Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics 2006, 7 Suppl 1: S2. 10.1186/1471-2105-7-S1-S2View ArticlePubMedGoogle Scholar
- Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG: Finding function: evaluation methods for functional genomic data. BMC Genomics 2006, 7: 187. 10.1186/1471-2164-7-187PubMed CentralView ArticlePubMedGoogle Scholar
- Scott MS, Thomas DY, Hallett MT: Predicting subcellular localization via protein motif co-occurrence. Genome Res 2004, 14(10A):1957–1966. 10.1101/gr.2650004PubMed CentralView ArticlePubMedGoogle Scholar
- D'Haeseleer P, Church GM: Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf 2004, 216–223.Google Scholar
- Xia K, Dong D, Han JD: IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model. BMC Bioinformatics 2006, 7: 508. 10.1186/1471-2105-7-508PubMed CentralView ArticlePubMedGoogle Scholar
- Lu X, Chen J, Sasmono RT, Hsi ED, Sarosiek KA, Tiganis T, Lossos IS: TCPTP, Distinctively Expressed in ABC-Like Diffuse Large B-Cell Lymphomas, is the Nuclear Phosphatase of STAT6. Mol Cell Biol 2007, 27(6):2166–2179. 10.1128/MCB.01234-06PubMed CentralView ArticlePubMedGoogle Scholar
- Le Clainche C, Schlaepfer D, Ferrari A, Klingauf M, Grohmanova K, Veligodskiy A, Didry D, Le D, Egile C, Carlier MF, Kroschewski R: IQGAP1 stimulates actin assembly through the N-WASP-Arp2/3 pathway. J Biol Chem 2007, 282(1):426–435. 10.1074/jbc.M607711200View ArticlePubMedGoogle Scholar
- Xu Y, Wong SH, Tang BL, Subramaniam VN, Zhang T, Hong W: A 29-kilodalton Golgi soluble N-ethylmaleimide-sensitive factor attachment protein receptor (Vti1-rp2) implicated in protein trafficking in the secretory pathway. J Biol Chem 1998, 273(34):21783–21789. 10.1074/jbc.273.34.21783View ArticlePubMedGoogle Scholar
- Galli T, Zahraoui A, Vaidyanathan VV, Raposo G, Tian JM, Karin M, Niemann H, Louvard D: A novel tetanus neurotoxin-insensitive vesicle-associated membrane protein in SNARE complexes of the apical plasma membrane of epithelial cells. Mol Biol Cell 1998, 9(6):1437–1448.PubMed CentralView ArticlePubMedGoogle Scholar
- Mallard F, Tang BL, Galli T, Tenza D, Saint-Pol A, Yue X, Antony C, Hong W, Goud B, Johannes L: Early/recycling endosomes-to-TGN transport involves two SNARE complexes and a Rab6 isoform. J Cell Biol 2002, 156(4):653–664. 10.1083/jcb.200110081PubMed CentralView ArticlePubMedGoogle Scholar
- Komamura-Kohno Y, Karasawa-Shimizu K, Saitoh T, Sato M, Hanaoka F, Tanaka S, Ishimi Y: Site-specific phosphorylation of MCM4 during the cell cycle in mammalian cells. Febs J 2006, 273(6):1224–1239. 10.1111/j.1742-4658.2006.05146.xView ArticlePubMedGoogle Scholar
- Colland F, Jacq X, Trouplin V, Mougin C, Groizeleau C, Hamburger A, Meil A, Wojcik J, Legrain P, Gauthier JM: Functional proteomics mapping of a human signaling pathway. Genome Res 2004, 14(7):1324–1332. 10.1101/gr.2334104PubMed CentralView ArticlePubMedGoogle Scholar
- Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4(7):1985–1988. 10.1002/pmic.200300721View ArticlePubMedGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32(Database issue):D449–51. 10.1093/nar/gkh086PubMed CentralView ArticlePubMedGoogle Scholar
- Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, Buzadzija K, Cavero R, D'Abreo C, Donaldson I, Dorairajoo D, Dumontier MJ, Dumontier MR, Earles V, Farrall R, Feldman H, Garderman E, Gong Y, Gonzaga R, Grytsan V, Gryz E, Gu V, Haldorsen E, Halupa A, Haw R, Hrvojic A, Hurrell L, Isserlin R, Jack F, Juma F, Khan A, Kon T, Konopinsky S, Le V, Lee E, Ling S, Magidin M, Moniakis J, Montojo J, Moore S, Muskat B, Ng I, Paraiso JP, Parker B, Pintilie G, Pirone R, Salama JJ, Sgro S, Shan T, Shu Y, Siew J, Skinner D, Snyder K, Stasiuk R, Strumpf D, Tuekam B, Tao S, Wang Z, White M, Willis R, Wolting C, Wong S, Wrong A, Xin C, Yao R, Yates B, Zhang S, Zheng K, Pawson T, Ouellette BF, Hogue CW: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 2005, 33(Database issue):D418–24. 10.1093/nar/gki051PubMed CentralView ArticlePubMedGoogle Scholar
- Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics 2005, 21(9):2076–2082. 10.1093/bioinformatics/bti273View ArticlePubMedGoogle Scholar
- Breitkreutz BJ, Stark C, Tyers M: The GRID: the General Repository for Interaction Datasets. Genome Biol 2003, 4(3):R23. 10.1186/gb-2003-4-3-r23PubMed CentralView ArticlePubMedGoogle Scholar
- Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R: NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res 2005, 33(Database issue):D562–6. 10.1093/nar/gki022PubMed CentralView ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 2004, 101(16):6062–6067. 10.1073/pnas.0400782101PubMed CentralView ArticlePubMedGoogle Scholar
- O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 2005, 33(Database issue):D476–80. 10.1093/nar/gki107PubMed CentralView ArticlePubMedGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U, Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN, Orchard S, Pagni M, Ponting CP, Quevillon E, Selengut J, Sigrist CJ, Silventoinen V, Studholme DJ, Vaughan R, Wu CH: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33(Database issue):D201–5. 10.1093/nar/gki106PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247–51. 10.1093/nar/gkj149PubMed CentralView ArticlePubMedGoogle Scholar
- Jefferson ER Walsh, T., Roberts, T. and Barton, G. J.: SNAPPI-DB: A Database and API of Structures, iNterfaces and Alignments for Protein-Protein Interactions. Nucleic Acids Research 2007.Google Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34(Database issue):D187–91. 10.1093/nar/gkj161PubMed CentralView ArticlePubMedGoogle Scholar
- Singh GP, Ganapathi M, Dash D: Role of intrinsic disorder in transient interactions of hub proteins. Proteins 2006, 66(4):761–765. 10.1002/prot.21281View ArticleGoogle Scholar
- Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 2006, 7: 208. 10.1186/1471-2105-7-208PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.