- Methodology article
- Open Access
Inferring high-confidence human protein-protein interactions
© Yu et al.; licensee BioMed Central Ltd. 2012
- Received: 5 October 2011
- Accepted: 4 May 2012
- Published: 4 May 2012
As numerous experimental factors drive the acquisition, identification, and interpretation of protein-protein interactions (PPIs), aggregated assemblies of human PPI data invariably contain experiment-dependent noise. Ascertaining the reliability of PPIs collected from these diverse studies and scoring them to infer high-confidence networks is a non-trivial task. Moreover, a large number of PPIs share the same number of reported occurrences, making it impossible to distinguish the reliability of these PPIs and rank-order them. For example, for the data analyzed here, we found that the majority (>83%) of currently available human PPIs have been reported only once.
In this work, we proposed an unsupervised statistical approach to score a set of diverse, experimentally identified PPIs from nine primary databases to create subsets of high-confidence human PPI networks. We evaluated this ranking method by comparing it with other methods and assessing their ability to retrieve protein associations from a number of diverse and independent reference sets. These reference sets contain known biological data that are either directly or indirectly linked to interactions between proteins. We quantified the average effect of using ranked protein interaction data to retrieve this information and showed that, when compared to randomly ranked interaction data sets, the proposed method created a larger enrichment (~134%) than either ranking based on the hypergeometric test (~109%) or occurrence ranking (~46%).
From our evaluations, it was clear that ranked interactions were always of value because higher-ranked PPIs had a higher likelihood of retrieving high-confidence experimental data. Reducing the noise inherent in aggregated experimental PPIs via our ranking scheme further increased the accuracy and enrichment of PPIs derived from a number of biologically relevant data sets. These results suggest that using our high-confidence protein interactions at different levels of confidence will help clarify the topological and biological properties associated with human protein networks.
The development of high-throughput techniques during the last decade has led to an unprecedented increase in the volume of identified human protein-protein interactions (PPIs). The currently available individual PPI data sets can be roughly categorized into three sets: 1) proteome-wide, large-scale screenings aimed at investigating all possible PPIs [1–3], 2) semi-large-scale screenings aimed at investigating the interactions between a specific group of proteins (typically in a pathway) and all other proteins [4, 5], and 3) small-scale, traditional studies aimed at detecting specific PPIs among biologically interesting proteins, e.g., oncogenes and their regulators. Although this latter set is still numerically dominant (~80% of all PPIs belong to this set), examples of the first two types of investigations are expanding rapidly.
Given this extensive resource of known human PPIs and their continuous accelerated growth, how to globally analyze and aggregate the data remain a challenge. Statistical methods for inferring confidence of protein interactions can be broadly divided into two groups [6–8]: scoring schemes that rely on the interaction data themselves (e.g., affinity purification/mass-spectrometry [AP/MS] data or yeast two-hybrid [Y2H] data) and scoring schemes that require additional data sources not directly related to the interactions per se (e.g., functional annotation or gene expression data). Herein, we address the question of how to extract high-confidence PPIs while relying only on the aggregated interaction data themselves.
The most intuitive approach to infer high-confidence PPIs is to score PPIs based on the number of times an interaction has been reported [9–11]. However, using the number of times a PPI has been reported (occurrence) across different studies as the metric of reliability could be influenced by numerous unknowable experimental factors, e.g., recent studies have demonstrated that such factors may result in decreased reliability of PPIs containing frequently studied proteins . Moreover, a large number of PPIs share the same number of reported occurrences, making it impossible to use occurrence alone to establish the reliability of these PPIs and rank-order them. For example, for the data analyzed here, we found that the majority (>83%) of currently available human PPIs have been reported only once.
Herein, we propose an unsupervised statistical approach to score and rank a set of diverse, experimentally identified PPIs. We applied this methodology to human PPIs (non-physical associations excluded) aggregated from nine publicly available primary databases that exclusively contain experimental data (Additional file 1): the Biomolecular Interaction Network Database (BIND) , the Biological General Repository for Interaction Datasets (BioGRID) , the Database of Interacting Proteins (DIP) , the Human Protein Reference Database (HPRD) , IntAct , the Molecular INTeraction database (MINT) , the mammalian PPI database of the Munich Information Center on Protein Sequences (MIPS) , PDZBase (a PPI database for PDZ-domains) , and Reactome . Our method re-normalizes the importance of frequently occurring proteins among PPIs to avoid giving added (and potentially artificial) weight to those interactions. We estimated the importance of a PPI by comparing the actual observed occurrence of a PPI with its occurrence in a randomized sample. This calculation gauges the likelihood that the interaction occurs by chance in the set of all observed PPIs. Using these estimates, we rank-ordered the aggregated input PPI data set, allowing us to create high-confidence subsets based on a given rank threshold. At the lowest ranked threshold, all interactions are included and there is no difference between the ranked data and the original set of PPIs.
The presented scoring and ranking procedure can be seen as an extension of our previous effort to infer high-confidence interactions from the affinity purification raw data, termed interaction detection based on shuffling (IDBOS) , and, in the following, we will also refer to our scoring and ranking scheme as IDBOS. Our proposed procedure shares similarities to estimating probabilities of observed interactions above a random background based on the hypergeometric distribution , with the distinction that the IDBOS-generated probability density distribution functions correct for biases toward self-interaction among frequently studied proteins. Although other methods exist for assigning confidence scores to PPIs, these generally require additional data or reference sets [24, 25], or a priori assumptions of network topology . To the best of our knowledge, this is the first application of an unsupervised probabilistic scoring and ranking scheme to create subsets of unbiased high-confidence human PPI networks.
We evaluated the improvement in using IDBOS-ranked PPI data by comparing it with other methods and assessing their ability to retrieve biological associations from a number of diverse and independent reference sets. These reference sets contain known biological data that are either directly (e.g., crystallographically determined protein complexes) or indirectly (e.g., co-expressed genes) linked to interactions between proteins. The hypothesis we tested was that sets of highly ranked PPIs are enriched in biological associations as determined from the diverse reference sets. We quantified the average effect of using ranked protein interaction data to retrieve this information and showed that, when compared to randomly ranked interaction data sets, IDBOS created a larger enrichment (~134%) than either ranking based on the hypergeometric test (~109%) or occurrence ranking (~46%).
From our evaluations, it was clear that ranked interactions were always of value because higher-ranked PPIs had a higher likelihood of retrieving biologically relevant data. Statistically removing the biasing factors inherent in aggregated PPI data via the IDBOS-ranking scheme further increased the accuracy and enrichment of biological information associated with PPIs.
Statistical scoring of human PPIs from literature data
We used the collection of human experimental and physical PPIs to create a set of 116,134 reported interactions, containing 80,980 unique physical associations between 13,369 distinct proteins (see Materials and methods). Out of the unique PPIs, 13,554 interactions were observed in more than one experiment. The number of times a PPI has been reported in the literature is an important metric for inferring high-confidence interactions [9, 10, 27], but it could also be dependent on other factors [12, 28]. The observed number of interactions of a protein is partly reflective of how often it has been studied (popularity). For example, the top five connected proteins in the PPI data are G-protein beta subunit (GNB1), G-protein gamma subunit (GNGT1), G-protein alpha subunit (GNAL), ubiquitin C (UBC), and tumor protein p53 (TP53), having 2,280, 2,248, 2,243, 1,899, and 1,097 reported interactions, respectively. To normalize this popularity bias, we compared the observed number of protein interactions with statistics derived from the corresponding probability density distribution functions generated from randomized data. In the generation of the random interaction sets, we kept each protein’s reported number of interactions fixed and, thus, we expect the corresponding interaction probabilities of proteins with a high (low) number of reported interactions also to be high (low).
Note that the reported number of interactions involving a protein refers to the total number of observed PPIs in the literature involving that specific protein. This is different from a protein’s degree, which is defined as the number of unique interacting protein partners. Thus, while TP53 is associated with 1,097 observed PPIs, its degree is reduced to 478 due to the multiple observations of many of the involved interactions.
Z-scores and p-values (Z ij and p ij ) are legitimate metrics for ranking an observed PPI, although they are associated with different numerical properties and uncertainties because of the limitations imposed on using a finite number of randomizations.
Assessing p-values and Z-scores as PPI metrics
The procedure for shuffling the data allowed us to compare the frequency of each observed interaction O ij to that interaction’s probability density distribution function PDF ij . This distribution is different for each interaction and depends on the number of distinct proteins and interactions present in the entire network. Lacking an analytical expression to generate the exact distribution functions, the procedure outlined in Equations 1 and 2 allowed us to generate estimates for both Z-scores and p-values for each interacting pair. We found that using either only p-values or only Z-scores to be inferior to using the combined information (data not shown), and, hence, we aggregated these two metrics by converting the value of each metric into a rank and generating the average rank from both metrics for each interaction. Interactions with the same p-value (or Z-score) were assigned the same rank. If interactions had the same rank in the final averaged-rank list, these interactions were considered equivalent and analyzed together. The ranked data are provided in the Supplementary materials (Additional file 3). Instead of assigning post-priori probabilities to already observed interactions, we only used ranks and comparisons between ranked data to gauge the biological information contained in these subsets.
Top-ranked IDBOS PPIs are different from the most frequently reported PPIs
Top 20 protein-protein interactions from IDBOS ranking
Inducible T cell co-stimulator
Inducible T cell co-stimulator ligand
Non-SMC condensin I, G
Non-SMC condensin I, H
Inducible T cell co-stimulator
Inducible T cell co-stimulator ligand
Pyroglutamylated RFamide peptide
Branched chain keto acid dehydro. E1, alpha
Branched chain keto acid dehydro. E1, beta
Artemin (GDNF family)
GDNF family receptor alpha 3
-C motif) ligand 1
Chemokine (C-X3-C motif) receptor 1
Gastric inhibitory polypeptide
Gastric inhibitory polypeptide receptor
Polymerase (DNA directed), gamma
Polymerase (DNA directed), gamma2
tRNA (guanine-N(7)-)-methyltran. subunit WDR4
Interleukin 11 receptor, alpha
Hemoglobin, alpha 2
C-type lectin domain family 2, member D
Killer cell lectin-like receptor subfamily B, 1
Leukotriene C4 synthase
Microsomal glutathione S-transferase 1
Leukocyte antigen CD97
Dynein, cyto-plasmic 2, light intermed. chain 1
Dynein, cytoplasmic 2, heavy chain 1
Interleukin 22 receptor, alpha 1
MRC OX-2 antigen
CD200 receptor 1
Top 20 protein-protein interactions from occurrence ranking
Mouse double minute 2 homolog
Tumor protein p53
Tumor protein p53
Hemoglobin, alpha 2
Cas-Br-M ecotro- pic sequence
Epidermal growth factor receptor
Cas-Br-M ecotro- pic sequence
Growth factor receptor bound 2
Fanconi anemia, complementation A
Fanconi anemia, complementation G
Epidermal growth factor receptor
Breast cancer 2, early onset
DNA repair protein RAD51 homolog 1
Hypoxia inducible factor 1, alpha
von Hippel-Lindau tumor suppressor
Synaptosomal-associated protein, 25kDa
MYC associated factor X
BRCA1 assoc. RING domain 1
Breast cancer 1, early onset
Growth factor receptor bound 2
SHC transforming protein 1
Cadherin 1, type 1
Catenin, beta 1
E2F transcription factor 1
Growth factor receptor bound 2
Son of sevenless homolog 1
Cyclin-dependent kinase 2
Epidermal growth factor
Epidermal growth factor receptor
Nucl. factor of kappa light chain gene enhancer in B-cells
V-rel reticulo-endotheliosis viral oncogene homolog A
Not only were the top-ranked sets different, but the remaining bulk of the interactions also showed considerable changes in rank when the IDBOS p-values were non-zero (See Additional files 3 and 4). To evaluate the effect of this re-ranking of interactions, we then asked whether these rankings have any impact in retrieving biological information. We addressed this question by comparing the ability of the two schemes to identify known and inferred biological relationships based on rank. The hypothesis we tested was that higher-ranked subsets of the data sets are better at retrieving biological information, and that IDBOS ranking provides a more efficient way of retrieving this information than ranking solely based on frequency of occurrence.
Evaluation of ranking schemes as measures of identifying interacting proteins
To assess different PPI ranking schemes, we constructed six benchmark reference sets derived from high-quality experimental studies as detailed in Materials and methods. These independent data sets comprise PPIs detected using 1) far-Western blotting, 2) isothermal titration calorimetry, 3) nuclear magnetic resonance, 4) surface plasmon resonance, 5) direct interactions from protein complex structures from the PDB, and 6) homologous human PPIs derived from actual mouse PPI data. These benchmark sets tested the ability of the ranked data to retrieve known interactions derived from a variety of experimentally determined PPI data sets.
Using these six benchmark reference sets, we compared the IDBOS-ranked data set with a set ranked by PPI frequency of occurrence, a set ranked using the hypergeometric test  (see Materials and methods, Additional file 5), and two random data sets. The random data sets included a set in which the PPIs themselves were retained, but their ranks were assigned randomly, and a set where both the interactions themselves were randomized and were assigned a random rank. In addition, we included the results obtained by directly using the nine individual data sources that formed the aggregated collection of human PPIs. The objective here was to assess each ranking method’s capability to retrieve true interactions in each of the reference sets, as a function of rank, i.e., higher-ranked data should contain a larger fraction of true interactions than lower-ranked data.
To make a fair comparison within each benchmark reference set, for each set we defined a protein pair from the aggregated data set as a judgeable interaction if both proteins appeared in the reference set; otherwise, we assumed that the pair could not be judged by the reference set. If a judgeable interaction was observed in the reference set, we termed it a true interaction. A good scheme would rank true interactions before other judgeable interactions. To quantify the ability to retrieve true interactions in each scored set, we extracted the judgeable subset and, using each corresponding rank as a threshold, we counted the numbers of judgeable and true interactions with scores above the threshold. We defined the number of true interactions as coverage and the fraction of true interactions among those judgeable interactions as accuracy.
Evaluation of different ranking schemes
Isothermal titration calorimetry
Nuclear magnetic resonance
Surface plasmon resonance
PDB complex PPIs
Mouse homologous PPIs
The independent scores (Z-scores or p-values) were typically better than, or equivalent to, the hypergeometric method: Z-score ranking was close to the IDBOS, and better than the hypergeometric, whereas the p-value ranking was not as good as the hypergeometric. For example, using the PDB benchmarking set, the gains of the IDBOS, Z-score, hypergeometric, and p-value methods relative to the average accuracy of the randomly ranked data were 258%, 253%, 217%, and 216%, respectively; using the mouse homologous benchmarking set, the gains were 222%, 211%, 198%, and 155%, respectively. Thus, the IDBOS combined rankings consistently outperformed all attempted scoring schemes.
IDBOS versus hypergeometric test rankings
The observed differences in Table 3 and Figure 2 between the IDBOS and hypergeometric test rankings warrants further comment. The fundamental difference between the IDBOS and hypergeometric test rankings lies in how they account for interactions between self-interacting proteins. For the IDBOS randomized data, each pair of proteins was re-assigned an interaction partner such that no self-interacting protein pairs remained in the final set used to construct the probability density distribution functions. In the hypergeometric test, self-interacting protein pairs were assigned finite probabilities of occurrence based on a background distribution for each protein pair, which was different from IDBOS. Conceptually, the sum of the probability of observing all interactions with a given protein A among all other proteins pAA, pAB, pAC, etc., was the same in both schemes. However, the constraint that pAA was zero in IDBOS and pAA was non-zero in the hypergeometric test, re-distributed the probabilities such that any interaction probability pAB between protein A and another protein B could be different in the two schemes. This strongly influenced the probability of detecting proteins that occurred with high frequency in the data set.
Enrichment of known domain-domain interactions
Domain-domain interaction enrichment
Evaluation of ranking PPIs as a means of retrieving biological information
Up until now, the different scoring schemes were used to evaluate reference data sets that can be considered to be directly related to interacting proteins. We next evaluated the improvement that could be gained by using the differently ranked data sets to rank-order the interactions in reference data sets that are presumed to be enriched with interacting proteins, i.e., in sets of proteins that share the same Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, are implicated in the same disease, share Gene Ontology (GO) function, or tissue mRNA expression levels.
Case I: Enrichment of KEGG co-pathway gene pairs
Evaluation of ranked interactions to detect biological relationships
Case II: Enrichment of co-disease gene pairs
Genomic variations may underlie different susceptibilities to disease, e.g., individuals with specific single nucleotide polymorphisms (SNPs) can develop, or be pre-disposed to, a particular disease phenotype. Furthermore, genes encoding interacting proteins are more likely to occur within the same disease classification . Using the gene co-disease data extracted by Goh et al.  from the Online Mendelian Inheritance in Man (OMIM) dataset , we investigated the enrichment of co-disease gene pairs as a function of rank threshold (Figure 5B, Table 5). A gene pair was termed a co-disease gene pair if their SNPs led to susceptibility to the same disease. With this co-disease gene pair set, we also extracted judgeable protein pairs whose encoding genes appeared in this co-disease set (i.e., genes that have an actual disease annotation). Similar to the previous analyses, the IDBOS-scored set showed an enhanced probability of identifying existing disease-susceptible genes as compared with the frequency of occurrence ranking scheme, hypergeometric-based rankings, or random controls.
Case III: Enrichment of GO co-function gene pairs
Figure 5C shows the ability of the IDBOS and frequency of occurrence ranking schemes to identify PPIs whose proteins share the same gene ontology (GO) function. Both ranking schemes displayed a similar rank-dependent enrichment of co-function proteins compared to the randomly ranked (all ranks have the same importance) and randomly rewired (interactions are random) data set. Table 5 indicates that the improvements of the ranking schemes compared to randomly ranked data are relatively modest in retrieving functionally related gene pairs. As pointed out by Gillis and Pavlidis , the ability of PPI networks to distinguish co-functionality is largely influenced by the multifunctional nature of the gene annotations schemes themselves. In this case, the IDBOS, hypergeometric, and occurrence ranking schemes themselves will also have less influence on the retrieval of functionally related protein pairs compared to the annotation scheme itself.
Case IV: Enrichment of tissue co-expression gene pairs
Finally, we examined the occurrence of interacting protein pairs in a large data set that maps out the global mRNA expression levels for ~20,000 human genes in 32 normal tissues . For detail on the construction of this reference set, see Materials and methods. The biological hypothesis tested here examined co-expression of two mRNAs in the same tissue as indicative that the corresponding proteins had higher probability of interaction. Figure 5D shows the fraction of PPIs retrieved as a function of ranking scheme. Overall, the fraction of PPIs contained in the co-expression data set was lower than that of the other biological reference sets examined above. However, the investigated ranking schemes were able to produce ranked data sets that contain a slightly larger fraction of co-expressed PPIs than using no ranking (Table 5).
Topological differences of ranked PPI networks
Selecting highly ranked subsets of PPIs, using either IDBOS or occurrence, had the effect of reducing the number of high-degree proteins (hubs) in the overall network. This effect was most pronounced in the top-ranked IDBOS networks, which effectively down-weights interactions of popular proteins. The resulting top-ranked IDBOS networks were not dominated by hub interactions, but show a more distributed interaction network, with distinct topological and scaling properties.
Single experiments designed for large-scale detection of PPIs judge multiple observations of the same PPI as evidence for the confidence of the interaction. Multiple observations, across many different experiments, of a particular PPI also lend confidence that this interaction occurs in nature. One would then conclude that if one aggregated all known PPI experiments, the more times a particular PPI occurred, the more confident one would be of the interaction. We found a somewhat counterintuitive result in that ranking PPIs in order of the number of times a PPI has been observed was actually not the optimal way of assessing the importance of frequently reported interactions. Instead, we used the IDBOS method, which ranks interacting protein pairs in the observed PPI data sets as compared to random occurrences derived from an empirically reconstructed probability density distribution function specific to each interaction. We then used these ranks to order the PPI data and showed that this approach consistently identified more known protein associations in numerous benchmark reference sets than using either hypergeometric test rankings or simply frequency of occurrence. The improvement of IDBOS over the hypergeometric test results stems from the differences in the underlying random distributions in the two schemes. While the hypergeometric test does not impose any limitations on self-interacting proteins, by construction, the IDBOS procedure sets their probability of occurrence to zero even for the random case. The input data contain no self-interacting proteins; therefore, the corresponding random probability density distribution functions generated by IDBOS seem to be the most appropriate to gauge unbiased PPIs in this data set.
We evaluated the improvement in using ranked PPI data to identify known protein relationships in eleven independent reference sets, including direct and indirect readouts of protein interactions. We used a large number of different reference sets to perform as comprehensive an analysis as possible, as each reference set only covers a specific subset of protein interactions. It was clear that ranking interactions was of value, because higher-ranked PPIs had better success at retrieving protein associations. Across all sets investigated, occurrence ranking improved the performance by 46% compared to using non-ranked data, hypergeometric ranking achieved a 109% increase, and the IDBOS ranking scheme achieved a 134% increase. This conformed to the assumption that aggregating and creating high-confidence subsets of data creates added value, albeit at a cost of reducing the overall number of PPIs that can be considered.
We have developed a statistical approach to infer subsets of high-confidence human PPIs, and showed that ranked data can consistently enrich the accuracy of the retrieved PPI data in these sets. Our IDBOS method was more successful in ranking interactions than using either the number of times an interaction has been observed across experiments or rankings based on the hypergeometric test. Furthermore, using either IDBOS or hypergeometric scoring schemes generates unique ranks for almost all interactions, as opposed to the frequency of occurrence method, in which many interactions have the same integer score corresponding to the number of observed occurrences. The IDBOS ranking increased accuracy and enrichment of protein interaction data associated with PPIs by more than two-fold compared to simply ranking interactions based on observed occurrences. We achieved this improvement by comparing the observed interaction data with a probability density distribution function that does not inflate the statistical importance of interactions associated with frequently studied proteins. These results suggest that using our high-confidence protein interactions at different levels of confidence could help clarify the dependence on confidence on topological and biological properties associated with human protein networks.
Materials and methods
Statistical scoring of PPIs from aggregated experimental data
We downloaded the collection of PPIs in October 2011 from the nine databases that covers the bulk of all known experimentally determined PPIs. Databases of non-primary nature, i.e., containing aggregated data and/or predicted and inferred interactions, were excluded. From this collection, we extracted the subset of human physical PPIs, consisting of PPIs from both large-, semi-large-, and small-scale experimental studies. Because Reactome has no standardized annotations describing physical associations or direct interactions, instead we extracted PPIs annotated as “direct complexes.” We treated each interaction reported by each study (identified by a unique publication ID) as a unique record by deleting redundant copies arising from the overlaps among the nine databases. In total, we analyzed 13,369 proteins, 80,980 PPIs, and 116,134 records (Additional file 1). The computational procedure to generate 106 random realizations of the PPI data set and compute the corresponding p-values and Z-scores took ~2,000 minutes on a dual core Xeon Irwindale 3.6 GHz 64-bit Linux server equipped with 4 GB of RAM. We used fractional rankings, i.e., PPIs that had the same score received the same ranking number, which is the mean of what they would have received under ordinal rankings. The Supplementary material provides the scored and ranked PPI data set, with equivalently ranked interactions tabulated in arbitrary order.
where N i (N j ) is the number of times protein i (j) was observed in the aggregated data set, and N is the total number of interactions in the aggregated data set of PPIs. This formulation is correct in the limit of N > > N i N j , which was satisfied in this data set. Finally, we ranked the PPI data based on the calculated probabilities.
We mapped all PDB sequences (downloaded in October 2010) to the human protein sequences stored in the UniProtKB/Swiss-Prot database (downloaded in June 2008) and obtained 649 protein complexes that contained at least two different human proteins. We then calculated the contact area of each intra-complex human protein pair using the program EMPIRE , to determine which protein pairs interact. For protein pairs occurring in multiple complexes, we selected the pair with the largest contact area as an interacting pair. We collected 281 direct protein interactions between 563 proteins, with contact areas ranging from 0.1 to 183 nm2, as a reference set for known human PPIs derived from structural data (Additional file 2).
Reference sets from the HIPPIE database
We extracted four reference sets from the Human Integrated Protein-Protein Interaction rEferenc (HIPPIE) data set (http://cbdm.mdc-berlin.de/tools/hippie/ information.php, January 2012), each of which consists of PPIs detected by a top-scored experiment technique (score = 10, specified with European Bioinformatics Institute (EBI) term ID, and number of PPIs >100). The extracted data set contained PPIs derived from 1) far-Western blotting, 2) isothermal titration calorimetry, 3) nuclear magnetic resonance, and 4) surface plasmon resonance experiments.
Homologous human PPIs from mouse data
We extracted the subset of mouse physical PPIs from the above collection and constructed a homologous human PPI set according to the sequence homology between mouse and human proteins defined by the National Center for Biotechnology Information (NCBI) homology mapping scheme (ftp://ftp.ncbi.nih.gov/pub/HomoloGene/, October 2009). We used this homologous human PPI set (containing 3,148 interactions between 5,632 proteins) as a reference data set reflective of direct protein interactions.
Domain annotation and interaction data
We used the Pfam-A families of the Pfam21.0 database as the source for domain annotation . We also downloaded the DDI set inferred from PDB crystal structures from the iPfam database . This DDI set contains 4,030 interactions between 2,837 Pfam-A domains.
Co-disease susceptibility gene pairs
We used the disease susceptibility gene data of Goh et al. , which was constructed by processing OMIM raw data . There were 43,249 pairs between 3,670 genes whose two genes were susceptible to at least one common disease. We used these pairs to represent the co-disease susceptibility gene pairs.
Co-function gene pairs
where N g (N h ) is the number of GO annotations of gene g (h) and T is the total number of unique GO annotations. We used –log(p gh ) as the score and chose the top 1% of all gene pairs as a reference set, resulting in 1,502,420 co-function pairs for this reference set.
Co-pathway gene pairs
The KEGG pathway data file (hsa_gene_map.tab) was downloaded from its Web site (ftp://ftp.genome.jp/pub/kegg/pathway/, June 2010), which lists the pathways that each annotated gene participates in. Similar to the co-function score described above, the co-pathway score was also computed using the hypergeometric distribution method. We chose the top 1% as the reference set of significant co-pathway gene pairs (139,841 pairs).
Tissue co-expression of gene pairs
We used a global expression data set containing mRNA expression levels for ~20,000 human genes in 32 normal tissues derived from massively parallel signature sequencing . The raw data set documents the abundance values of each short sequence signature tag (with a total of 182,727 tags) in 32 tissues. We mapped these tags to the regions of the human genome (hg18) that encode genes, in both orientations. We obtained 105,512 hits in the gene regions in both orientations and, among them, 68,855 hits in the gene orientation (p-value < 10−2,000), indicating that the tags were able to distinguish the transcribed orientation from the non-transcribed orientation in the genome. Assigning the tags that hit a gene region and orientation to the corresponding gene, we obtained a set of tags for each gene, resulting in a total of 14,516 genes having non-empty tag sets. We summed up the abundance tissue profiles of the tags of a gene to create its raw expression profile.
We chose the top 1% of these scored gene pairs as the reference set of co-expressed gene pairs to evaluate the PPI scoring approaches. There were 1,053,529 co-expression pairs in this set.
We thank Drs. Jacob Feala, Vesna Memišević, and Xin Hu for helpful discussions. The authors were supported by the Military Operational Medicine Research Program of the U.S. Army Medical Research and Materiel Command, Ft. Detrick, Maryland, as part of the U.S. Army's Network Science Initiative. The opinions and assertions contained herein are the private views of the authors and are not to be construed as official or as reflecting the views of the U.S. Army or the U.S. Department of Defense. This paper has been approved for public release with unlimited distribution.
- Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al.: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437(7062):1173–1178. 10.1038/nature04209View ArticlePubMedGoogle Scholar
- Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al.: A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122(6):957–968. 10.1016/j.cell.2005.08.029View ArticlePubMedGoogle Scholar
- Ewing RM, Chu P, Elisma F, Li H, Taylor P, Climie S, McBroom-Cerajewski L, Robinson MD, O'Connor L, Li M, et al.: Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol Syst Biol 2007, 3: 89.PubMed CentralView ArticlePubMedGoogle Scholar
- Jeronimo C, Forget D, Bouchard A, Li Q, Chua G, Poitras C, Therien C, Bergeron D, Bourassa S, Greenblatt J, et al.: Systematic analysis of the protein interaction network for the human transcription machinery reveals the identity of the 7SK capping enzyme. Mol Cell 2007, 27(2):262–274. 10.1016/j.molcel.2007.06.027PubMed CentralView ArticlePubMedGoogle Scholar
- Sowa ME, Bennett EJ, Gygi SP, Harper JW: Defining the human deubiquitinating enzyme interaction landscape. Cell 2009, 138(2):389–403. 10.1016/j.cell.2009.04.042PubMed CentralView ArticlePubMedGoogle Scholar
- Suthram S, Shlomi T, Ruppin E, Sharan R, Ideker T: A direct comparison of protein interaction confidence assignment schemes. BMC Bioinforma 2006, 7: 360. 10.1186/1471-2105-7-360View ArticleGoogle Scholar
- Schelhorn SE, Mestre J, Albrecht M, Zotenko E: Inferring physical protein contacts from large-scale purification data of protein complexes. Mol Cell Proteomics 2011, 10(6):M110 004929.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu X, Ivanic J, Memisevic V, Wallqvist A, Reifman J: Categorizing biases in high-confidence high-throughput protein-protein interaction data sets. Mol Cell Proteomics 2011, 11: M111 012500. in press in pressView ArticleGoogle Scholar
- Wodak SJ, Pu S, Vlasblom J, Seraphin B: Challenges and rewards of interaction proteomics. Mol Cell Proteomics 2009, 8(1):3–18. 10.1074/mcp.R800014-MCP200View ArticlePubMedGoogle Scholar
- Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al.: High-quality binary protein interaction map of the yeast interactome network. Science 2008, 322(5898):104–110. 10.1126/science.1158684PubMed CentralView ArticlePubMedGoogle Scholar
- Hakes L, Pinney JW, Robertson DL, Lovell SC: Protein-protein interaction networks and biology–what's the connection? Nat Biotechnol 2008, 26(1):69–72. 10.1038/nbt0108-69View ArticlePubMedGoogle Scholar
- Pfeiffer T, Hoffmann R: Large-scale assessment of the effect of popularity on the reliability of research. PLoS One 2009, 4(6):e5996. 10.1371/journal.pone.0005996PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 2003, 31(1):248–250. 10.1093/nar/gkg056PubMed CentralView ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34(Database issue):D535–539.PubMed CentralView ArticlePubMedGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32(Database issue):D449–451.PubMed CentralView ArticlePubMedGoogle Scholar
- Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, et al.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13(10):2363–2371. 10.1101/gr.1680803PubMed CentralView ArticlePubMedGoogle Scholar
- Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, et al.: The IntAct molecular interaction database in 2010. Nucleic Acids Res 2009, 38(Database issue):D525–531.PubMed CentralPubMedGoogle Scholar
- Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Res 2007, 35(Database issue):D572–574.PubMed CentralView ArticlePubMedGoogle Scholar
- Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes HW, et al.: The MIPS mammalian protein-protein interaction database. Bioinformatics 2005, 21(6):832–834. 10.1093/bioinformatics/bti115View ArticlePubMedGoogle Scholar
- Beuming T, Skrabanek L, Niv MY, Mukherjee P, Weinstein H: PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics 2005, 21(6):827–828. 10.1093/bioinformatics/bti098View ArticlePubMedGoogle Scholar
- Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, et al.: Reactome: a knowledge base of biologic pathways and processes. Genome Biol 2007, 8(3):R39. 10.1186/gb-2007-8-3-r39PubMed CentralView ArticlePubMedGoogle Scholar
- Yu X, Ivanic J, Wallqvist A, Reifman J: A novel scoring approach for protein co-purification data reveals high interaction specificity. PLoS Comput Biol 2009, 5(9):e1000515. 10.1371/journal.pcbi.1000515PubMed CentralView ArticlePubMedGoogle Scholar
- Hart GT, Lee I, Marcotte ER: A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinforma 2007, 8: 236. 10.1186/1471-2105-8-236View ArticleGoogle Scholar
- Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1(5):349–356. 10.1074/mcp.M100037-MCP200View ArticlePubMedGoogle Scholar
- Deng M, F Sun, T Chen: Assessment of the reliability of protein-protein interactions and protein function prediction. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing 2003, 8(140):151–4376.Google Scholar
- Goldberg DS, Roth FP: Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci U S A 2003, 100(8):4372–4376. 10.1073/pnas.0735871100PubMed CentralView ArticlePubMedGoogle Scholar
- Bossi A, Lehner B: Tissue specificity and the human protein interaction network. Mol Syst Biol 2009, 5: 260.PubMed CentralView ArticlePubMedGoogle Scholar
- Gillis J, Pavlidis P: The impact of multifunctional genes on "guilt by association" analysis. PLoS One 2011, 6(2):e17258. 10.1371/journal.pone.0017258PubMed CentralView ArticlePubMedGoogle Scholar
- Wynn RM, Kato M, Machius M, Chuang JL, Li J, Tomchick DR, Chuang DT: Molecular mechanism for regulation of the human mitochondrial branched-chain alpha-ketoacid dehydrogenase complex by phosphorylation. Structure 2004, 12(12):2185–2196. 10.1016/j.str.2004.09.013View ArticlePubMedGoogle Scholar
- Grissom PM, Vaisberg EA, McIntosh JR: Identification of a novel light intermediate chain (D2LIC) for mammalian cytoplasmic dynein 2. Mol Biol Cell 2002, 13(3):817–829. 10.1091/mbc.01-08-0402PubMed CentralView ArticlePubMedGoogle Scholar
- Mikami A, Tynan SH, Hama T, Luby-Phelps K, Saito T, Crandall JE, Besharse JC, Vallee RB: Molecular structure of cytoplasmic dynein 2 and its distribution in neuronal and ciliated cells. J Cell Sci 2002, 115(Pt 24):4801–4808.View ArticlePubMedGoogle Scholar
- Cabello OA, Eliseeva E, He WG, Youssoufian H, Plon SE, Brinkley BR, Belmont JW: Cell cycle-dependent expression and nucleolar localization of hCAP-H. Mol Biol Cell 2001, 12(11):3527–3537.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang S, Zhu G, Chapoval AI, Dong H, Tamada K, Ni J, Chen L: Costimulation of T cells by B7-H2, a B7-like molecule that binds ICOS. Blood 2000, 96(8):2808–2813.PubMedGoogle Scholar
- Wang S, Zhu G, Tamada K, Chen L, Bajorath J: Ligand binding sites of inducible costimulator and high avidity mutants with improved function. J Exp Med 2002, 195(8):1033–1041. 10.1084/jem.20011607PubMed CentralView ArticlePubMedGoogle Scholar
- Volz A, Goke R, Lankat-Buttgereit B, Fehmann HC, Bode HP, Goke B: Molecular cloning, functional expression, and signal transduction of the GIP-receptor cloned from a human insulinoma. FEBS Lett 1995, 373(1):23–29. 10.1016/0014-5793(95)01006-ZView ArticlePubMedGoogle Scholar
- Gallwitz B, Witt M, Morys-Wortmann C, Folsch UR, Schmidt WE: GLP-1/GIP chimeric peptides define the structural requirements for specific ligand-receptor interaction of GLP-1. Regul Pept 1996, 63(1):17–22. 10.1016/0167-0115(96)00019-5View ArticlePubMedGoogle Scholar
- Manhart S, Hinke SA, McIntosh CH, Pederson RA, Demuth HU: Structure-function analysis of a series of novel GIP analogues containing different helical length linkers. Biochemistry 2003, 42(10):3081–3088. 10.1021/bi026868eView ArticlePubMedGoogle Scholar
- Yamada Y, Seino Y: Physiology of GIP–a lesson from GIP receptor knockout mice. Horm Metab Res 2004, 36(11–12):771–774.View ArticlePubMedGoogle Scholar
- Tressel T, Thompson R, Zieske LR, Menendez MI, Davis L: Interaction between L-threonine dehydrogenase and aminoacetone synthetase and mechanism of aminoacetone production. J Biol Chem 1986, 261(35):16428–16437.PubMedGoogle Scholar
- Ta HX, Holm L: Evaluation of different domain-based methods in protein interaction prediction. Biochem Biophys Res Commun 2009, 390(3):357–362. 10.1016/j.bbrc.2009.09.130View ArticlePubMedGoogle Scholar
- Gupta S, Wallqvist A, Bondugula R, Ivanic J, Reifman J: Unraveling the conundrum of seemingly discordant protein-protein interaction datasets. Conf Proc IEEE Eng Med Biol Soc 2010, 2010: 783–786.PubMedGoogle Scholar
- Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics 2005, 21(3):410–412. 10.1093/bioinformatics/bti011View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28(1):27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang X, De la Cruz O, Pinto JM, Nicolae D, Firestein S, Gilad Y: Characterizing the expression of the human olfactory receptor gene family using a novel DNA microarray. Genome Biol 2007, 8(5):R86. 10.1186/gb-2007-8-5-r86PubMed CentralView ArticlePubMedGoogle Scholar
- Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease network. Proc Natl Acad Sci U S A 2007, 104(21):8685–8690. 10.1073/pnas.0701361104PubMed CentralView ArticlePubMedGoogle Scholar
- Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33(Database issue):D514–517.PubMed CentralView ArticlePubMedGoogle Scholar
- Jongeneel CV, Delorenzi M, Iseli C, Zhou D, Haudenschild CD, Khrebtukova I, Kuznetsov D, Stevenson BJ, Strausberg RL, Simpson AJ, et al.: An atlas of human gene expression from massively parallel signature sequencing (MPSS). Genome Res 2005, 15(7):1007–1014. 10.1101/gr.4041005PubMed CentralView ArticlePubMedGoogle Scholar
- Pierre S, Scholich K: Toponomics: studying protein-protein interactions and protein networks in intact tissue. Mol Biosyst 2010, 6(4):641–647. 10.1039/b910653gView ArticlePubMedGoogle Scholar
- Ivanic J, Yu X, Wallqvist A, Reifman J: Influence of protein abundance on high-throughput protein-protein interaction detection. PLoS One 2009, 4(6):e5815. 10.1371/journal.pone.0005815PubMed CentralView ArticlePubMedGoogle Scholar
- Liang S, Liu S, Zhang C, Zhou Y: A simple reference state makes a significant improvement in near-native selections from structurally refined docking decoys. Proteins 2007, 69(2):244–253. 10.1002/prot.21498PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al.: The Pfam protein families database. Nucleic Acids Res 2004, 32(Database issue):D138–141.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu X, Lin J, Zack DJ, Qian J: Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues. Nucleic Acids Res 2006, 34(17):4925–4936. 10.1093/nar/gkl595PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.