Inferring high-confidence human protein-protein interactions

Background As numerous experimental factors drive the acquisition, identification, and interpretation of protein-protein interactions (PPIs), aggregated assemblies of human PPI data invariably contain experiment-dependent noise. Ascertaining the reliability of PPIs collected from these diverse studies and scoring them to infer high-confidence networks is a non-trivial task. Moreover, a large number of PPIs share the same number of reported occurrences, making it impossible to distinguish the reliability of these PPIs and rank-order them. For example, for the data analyzed here, we found that the majority (>83%) of currently available human PPIs have been reported only once. Results In this work, we proposed an unsupervised statistical approach to score a set of diverse, experimentally identified PPIs from nine primary databases to create subsets of high-confidence human PPI networks. We evaluated this ranking method by comparing it with other methods and assessing their ability to retrieve protein associations from a number of diverse and independent reference sets. These reference sets contain known biological data that are either directly or indirectly linked to interactions between proteins. We quantified the average effect of using ranked protein interaction data to retrieve this information and showed that, when compared to randomly ranked interaction data sets, the proposed method created a larger enrichment (~134%) than either ranking based on the hypergeometric test (~109%) or occurrence ranking (~46%). Conclusions From our evaluations, it was clear that ranked interactions were always of value because higher-ranked PPIs had a higher likelihood of retrieving high-confidence experimental data. Reducing the noise inherent in aggregated experimental PPIs via our ranking scheme further increased the accuracy and enrichment of PPIs derived from a number of biologically relevant data sets. These results suggest that using our high-confidence protein interactions at different levels of confidence will help clarify the topological and biological properties associated with human protein networks.


Background
The development of high-throughput techniques during the last decade has led to an unprecedented increase in the volume of identified human protein-protein interactions (PPIs). The currently available individual PPI data sets can be roughly categorized into three sets: 1) proteome-wide, large-scale screenings aimed at investigating all possible PPIs [1][2][3], 2) semi-large-scale screenings aimed at investigating the interactions between a specific group of proteins (typically in a pathway) and all other proteins [4,5], and 3) small-scale, traditional studies aimed at detecting specific PPIs among biologically interesting proteins, e.g., oncogenes and their regulators. Although this latter set is still numerically dominant (~80% of all PPIs belong to this set), examples of the first two types of investigations are expanding rapidly.
Given this extensive resource of known human PPIs and their continuous accelerated growth, how to globally analyze and aggregate the data remain a challenge. Statistical methods for inferring confidence of protein interactions can be broadly divided into two groups [6][7][8]: scoring schemes that rely on the interaction data themselves (e.g., affinity purification/mass-spectrometry [AP/MS] data or yeast two-hybrid [Y2H] data) and scoring schemes that require additional data sources not directly related to the interactions per se (e.g., functional annotation or gene expression data). Herein, we address the question of how to extract high-confidence PPIs while relying only on the aggregated interaction data themselves.
The most intuitive approach to infer high-confidence PPIs is to score PPIs based on the number of times an interaction has been reported [9][10][11]. However, using the number of times a PPI has been reported (occurrence) across different studies as the metric of reliability could be influenced by numerous unknowable experimental factors, e.g., recent studies have demonstrated that such factors may result in decreased reliability of PPIs containing frequently studied proteins [12]. Moreover, a large number of PPIs share the same number of reported occurrences, making it impossible to use occurrence alone to establish the reliability of these PPIs and rankorder them. For example, for the data analyzed here, we found that the majority (>83%) of currently available human PPIs have been reported only once.
Herein, we propose an unsupervised statistical approach to score and rank a set of diverse, experimentally identified PPIs. We applied this methodology to human PPIs (nonphysical associations excluded) aggregated from nine publicly available primary databases that exclusively contain experimental data (Additional file 1): the Biomolecular Interaction Network Database (BIND) [13], the Biological General Repository for Interaction Datasets (BioGRID) [14], the Database of Interacting Proteins (DIP) [15], the Human Protein Reference Database (HPRD) [16], IntAct [17], the Molecular INTeraction database (MINT) [18], the mammalian PPI database of the Munich Information Center on Protein Sequences (MIPS) [19], PDZBase (a PPI database for PDZ-domains) [20], and Reactome [21]. Our method re-normalizes the importance of frequently occurring proteins among PPIs to avoid giving added (and potentially artificial) weight to those interactions. We estimated the importance of a PPI by comparing the actual observed occurrence of a PPI with its occurrence in a randomized sample. This calculation gauges the likelihood that the interaction occurs by chance in the set of all observed PPIs. Using these estimates, we rank-ordered the aggregated input PPI data set, allowing us to create high-confidence subsets based on a given rank threshold. At the lowest ranked threshold, all interactions are included and there is no difference between the ranked data and the original set of PPIs.
The presented scoring and ranking procedure can be seen as an extension of our previous effort to infer highconfidence interactions from the affinity purification raw data, termed interaction detection based on shuffling (IDBOS) [22], and, in the following, we will also refer to our scoring and ranking scheme as IDBOS. Our proposed procedure shares similarities to estimating probabilities of observed interactions above a random background based on the hypergeometric distribution [23], with the distinction that the IDBOS-generated probability density distribution functions correct for biases toward self-interaction among frequently studied proteins. Although other methods exist for assigning confidence scores to PPIs, these generally require additional data or reference sets [24,25], or a priori assumptions of network topology [26]. To the best of our knowledge, this is the first application of an unsupervised probabilistic scoring and ranking scheme to create subsets of unbiased high-confidence human PPI networks.
We evaluated the improvement in using IDBOSranked PPI data by comparing it with other methods and assessing their ability to retrieve biological associations from a number of diverse and independent reference sets. These reference sets contain known biological data that are either directly (e.g., crystallographically determined protein complexes) or indirectly (e.g., coexpressed genes) linked to interactions between proteins. The hypothesis we tested was that sets of highly ranked PPIs are enriched in biological associations as determined from the diverse reference sets. We quantified the average effect of using ranked protein interaction data to retrieve this information and showed that, when compared to randomly ranked interaction data sets, IDBOS created a larger enrichment (~134%) than either ranking based on the hypergeometric test (~109%) or occurrence ranking (~46%).
From our evaluations, it was clear that ranked interactions were always of value because higher-ranked PPIs had a higher likelihood of retrieving biologically relevant data. Statistically removing the biasing factors inherent in aggregated PPI data via the IDBOS-ranking scheme further increased the accuracy and enrichment of biological information associated with PPIs.

Statistical scoring of human PPIs from literature data
We used the collection of human experimental and physical PPIs to create a set of 116,134 reported interactions, containing 80,980 unique physical associations between 13,369 distinct proteins (see Materials and methods). Out of the unique PPIs, 13,554 interactions were observed in more than one experiment. The number of times a PPI has been reported in the literature is an important metric for inferring high-confidence interactions [9,10,27], but it could also be dependent on other factors [12,28]. The observed number of interactions of a protein is partly reflective of how often it has been studied (popularity). For example, the top five connected proteins in the PPI data are G-protein beta subunit (GNB1), G-protein gamma subunit (GNGT1), G-protein alpha subunit (GNAL), ubiquitin C (UBC), and tumor protein p53 (TP53), having 2,280, 2,248, 2,243, 1,899, and 1,097 reported interactions, respectively. To normalize this popularity bias, we compared the observed number of protein interactions with statistics derived from the corresponding probability density distribution functions generated from randomized data. In the generation of the random interaction sets, we kept each protein's reported number of interactions fixed and, thus, we expect the corresponding interaction probabilities of proteins with a high (low) number of reported interactions also to be high (low).
Note that the reported number of interactions involving a protein refers to the total number of observed PPIs in the literature involving that specific protein. This is different from a protein's degree, which is defined as the number of unique interacting protein partners. Thus, while TP53 is associated with 1,097 observed PPIs, its degree is reduced to 478 due to the multiple observations of many of the involved interactions.
Our calculations followed our previously described procedures for generating Z-scores from sets of interacting protein pairs [22]. Briefly, the aggregated PPI data tabulates all pairs and the number of their occurrence in the literature. From this list we counted, for each unique pair between proteins i and j, how many times it occurred O ij . Randomized versions of the original PPI list were then created under the conditions that: 1) the protein identifiers and the number of times they occur are preserved and 2) no interactions are allowed between proteins of the same identifier, i.e., self-interactions are not allowed. We generated M = 10 6 randomized PPI lists and calculated the average number of times each PPI from the original list occurred < R ij > and its standard deviation σ R . The corresponding Z-score was then calculated as, We further used the randomized data to estimate the p-value p ij of observing an interaction between proteins i and j in the original data set. Briefly, for each unique pair in the original PPI list, we calculated the number of times it occurred in each of the n = 1, . . . ,M random realizations and created the normalized probability density function PDF ij , i.e., the probability of finding O R ij ¼ 0; 1; 2;::: occurrences by random chance. We estimated the p-value for each actual interaction as, Z-scores and p-values (Z ij and p ij ) are legitimate metrics for ranking an observed PPI, although they are associated with different numerical properties and uncertainties because of the limitations imposed on using a finite number of randomizations.

Assessing p-values and Z-scores as PPI metrics
To assess the p-values and Z-scores as metrics in identifying and ranking high-confidence protein interactions in biological data sets, we compared the scored interactions with an independent PPI data set that was not part of the aggregated human data set. We used the human PPIs derived from experimentally determined structural protein complexes deposited in the Protein Data Bank (PDB) as outlined in the Materials and methods (Additional file 2). To assess our scored data using this data set as a benchmark, we selected only those interacting protein pairs from the scored human PPI data where both proteins appeared in the PDB-derived PPI data set. We termed these pairs as "judgeable," i.e., the proteins appear in the benchmark data set, and we can judge whether the interaction occurred or not. Thus, for a given score, we determined how many of our judgeable PPIs were interacting and non-interacting within the benchmark set. Figure 1A-B shows the frequency of interacting and non-interacting pairs as a function of Zscores and p-values. Interactions associated with high Zscores and small p-values were enriched with benchmark interactions, indicating that both metrics were effective in extracting high-confidence PPIs from the aggregated literature data.
The procedure for shuffling the data allowed us to compare the frequency of each observed interaction O ij to that interaction's probability density distribution function PDF ij . This distribution is different for each interaction and depends on the number of distinct proteins and interactions present in the entire network. Lacking an analytical expression to generate the exact distribution functions, the procedure outlined in Equations 1 and 2 allowed us to generate estimates for both Z-scores and p-values for each interacting pair. We found that using either only p-values or only Z-scores to be inferior to using the combined information (data not shown), and, hence, we aggregated these two metrics by converting the value of each metric into a rank and generating the average rank from both metrics for each interaction. Interactions with the same p-value (or Z-score) were assigned the same rank. If interactions had the same rank in the final averaged-rank list, these interactions were considered equivalent and analyzed together. The ranked data are provided in the Supplementary materials (Additional file 3). Instead of assigning post-priori probabilities to already observed interactions, we only used ranks and comparisons between ranked data to gauge the biological information contained in these subsets.
Top-ranked IDBOS PPIs are different from the most frequently reported PPIs Table 1 shows the 20 top-ranked PPIs that were identified using the IDBOS system (sorted by average rank), and Table 2 shows the most frequently reported PPIs (sorted by PPI occurrence). As expected, the most frequently reported PPIs involved ubiquitously studied proteins, such as those mentioned above (UBC and TP53) as well as growth factor receptor bound 2 (GRB2), which had 916 reported interactions. These interactions, however, were not observed in the top PPIs scored by the IDBOS procedure. Instead, the top PPIs of the IDBOS set were enriched with infrequently observed interactions. Although we could not independently assess whether these interactions are more "real" than the frequently observed ones, they are partly supported by independent literature citations. In fact, the proteins in these interactions could be classified into two groups. The first group comprised proteins that had the same specific function or were subunits of the same protein complex, such as branched chain keto acid E1 alpha (BCKDHA) and branched chain keto acid E1 beta (BCKDHB) [3,29], and dynein cytoplasmic 2 intermediate chain 1 (D2LIC) and dynein cytoplasmic 2 heavy chain 1 (DNCH2) [30,31]. The second group comprised proteins that had a ligand/receptor relationship, such as inducible T cell co-stimulator (ICOS) and inducible T cell co-stimulator ligand (ICOSLG) [32][33][34], or gastric inhibitory polypeptide (GIP) and gastric inhibitory polypeptide receptor (GIPR) [35][36][37][38]. Among the top 20 PPIs of the IDBOS set, two seemingly unrelated protein pairs were actually closely related. The proteins in the LTC4S/MGST1 interacting pair are actually two of the six members of the membrane-associated proteins in the eicosanoid and glutathione metabolism (MAPEG) family.
Similarly, the top-ranked interaction between L-threonine dehydrogenase (TDH) and aminoacetone synthetase (alias of GCAT) catalyzes the conversion of L-threonine to glycine [39].
Not only were the top-ranked sets different, but the remaining bulk of the interactions also showed considerable changes in rank when the IDBOS p-values were non-zero (See Additional files 3 and 4). To evaluate the effect of this re-ranking of interactions, we then asked whether these rankings have any impact in retrieving biological information. We addressed this question by comparing the ability of the two schemes to identify known and inferred biological relationships based on rank. The hypothesis we tested was that higher-ranked subsets of the data sets are better at retrieving biological information, and that IDBOS ranking provides a more efficient way of retrieving this information than ranking solely based on frequency of occurrence.

Evaluation of ranking schemes as measures of identifying interacting proteins
To assess different PPI ranking schemes, we constructed six benchmark reference sets derived from high-quality experimental studies as detailed in Materials and methods. These independent data sets comprise PPIs detected using 1) far-Western blotting, 2) isothermal titration calorimetry, 3) nuclear magnetic resonance, 4) surface plasmon resonance, 5) direct interactions from protein complex structures from the PDB, and 6) homologous human PPIs derived from actual mouse PPI data. These Frequency -log(p) Figure 1 IDBOS scoring schemes. The method presented is an extension to the interaction detection based on shuffling (IDBOS) method used for mass spectrometry co-purification data [22]. We compared the set of known protein-protein interactions (PPIs) with randomized versions, which preserve the number of interactions per protein, to obtain a Z-score and a p-value for each interaction. These quantities are schematically outlined at the top, where a randomized probability density distribution function (PDF) is used to illustrate the p-value and Z-score calculations for a particular interaction between proteins i and j. To evaluate these scoring schemes, we analyzed interactions derived from crystallographic complexes in the PDB. Each human PPI was compared to a PPI derived from protein structure data in the PDB and assigned to one of two subsets: interactions or non-interactions. If the PPI was present in the PDB interaction data set, the pair was assigned to the interactions set, otherwise the pair was assigned to the non-interactions set. It is reasonable to assume that the first subset should be enriched with actual PPIs. (A) Distribution of Z-scores corresponding to "interactions" and "non-interactions" assigned to PDB-derived PPIs, and (B) the corresponding p-value distributions. We found that both p-values and Z-scores could distinguish these subsets, suggesting that they are useful metrics.
benchmark sets tested the ability of the ranked data to retrieve known interactions derived from a variety of experimentally determined PPI data sets. Using these six benchmark reference sets, we compared the IDBOS-ranked data set with a set ranked by PPI frequency of occurrence, a set ranked using the hypergeometric test [23] (see Materials and methods, Additional file 5), and two random data sets. The random data sets included a set in which the PPIs themselves were retained, but their ranks were assigned randomly, and a set where both the interactions themselves were randomized and were assigned a random rank. In addition, we included the results obtained by directly using the nine individual data sources that formed the aggregated collection of human PPIs. The objective here was to assess each ranking method's capability to retrieve true interactions in each of the reference sets, as a function of rank, i.e., higher-ranked data should contain a larger fraction of true interactions than lower-ranked data. The top-20 ranked protein-protein interactions (PPIs) as identified by the interaction detection based on shuffling (IDBOS) method. We also provided the number of times each individual protein was observed in the aggregated data set, the number of times the PPI was observed, and the corresponding Z-score and associated fractional rank. All PPIs in this table have p-values <10 −6 , i.e., in the 10 6 randomization shuffles each of these PPIs were never observed to occur more times than they were actually observed in the original data set. In the data set there were 5,288 PPIs with zero p-values and, hence, we assigned their uniform pvalue rank to be 2,644.5 (5288/2). This value was used together with each Z-rank to create the average rank in the last column. The estimated Z-score of a PPI depends on the number of times each protein was observed in the original aggregated data set.
To make a fair comparison within each benchmark reference set, for each set we defined a protein pair from the aggregated data set as a judgeable interaction if both proteins appeared in the reference set; otherwise, we assumed that the pair could not be judged by the reference set. If a judgeable interaction was observed in the reference set, we termed it a true interaction. A good scheme would rank true interactions before other judgeable interactions. To quantify the ability to retrieve true interactions in each scored set, we extracted the judgeable subset and, using each corresponding rank as a threshold, we counted the numbers of judgeable and true interactions with scores above the threshold. We defined the number of true interactions as coverage and the fraction of true interactions among those judgeable interactions as accuracy. Top 20-ranked protein-protein interactions (PPIs), as identified by the number of times they were observed in the aggregated data set. The quantities and calculations are annotated as in Table 1, except that the data were sorted in decreasing order of PPI occurrence O ij . The proteins involved in these interactions are easily recognizable, and this partly reflects an underlying study-bias of focusing research efforts on a smaller set of high-interest proteins. Figure 2 shows the results of this analysis for the six interaction benchmark reference sets. As expected, the completely random PPI data set had no capability to retrieve direct protein interaction data. For the network created by assigning random ranks to existing PPIs, the accuracy was almost uniform at any given coverage.
IDBOS, hypergeometric, and frequency of occurrence ranking were each associated with enhanced accuracy at higher ranks, thus indicating that true interactions ranked before other judgeable interactions. We assessed the overall performance of the ranking methods by calculating relative improvement compared to using no ranking, which is equivalent to assigning random ranks to the existing data. For the accuracy as a function of coverage plots shown in Figure 2, we calculated the average accuracy < A >over all ranks r as, where n(r) is the number of true PPIs at rank r and N is the sum of all n(r)'s. The gain of using IDBOS, hypergeometric, or occurrence as a ranking method was estimated by comparing the average accuracy < A > to the randomly ranked data < A R >as, Table 3 lists the gains in average accuracies in each reference set for the different ranking schemes. The IDBOS ranking scheme shows the greatest increase compared to using randomly assigned ranks, achieves a twofold increase in accuracy compared with using frequency of occurrence ranking, and consistently outperforms the hypergeometric ranking. In practice, one should also note that selecting any high-accuracy subset of PPIs based on ranked interaction data creates a smaller data set and lowers the overall PPI coverage.
The independent scores (Z-scores or p-values) were typically better than, or equivalent to, the hypergeometric method: Z-score ranking was close to the IDBOS, and better than the hypergeometric, whereas the p-value ranking was not as good as the hypergeometric. For example, using the PDB benchmarking set, the gains of the IDBOS, Z-score, hypergeometric, and p-value methods Isothermal titration calorimetry Figure 2 Retrieving protein-protein interactions. The proposed interaction detection based on shuffling (IDBOS) ranking scheme was compared with hypergeometric test ranking, frequency of occurrence ranking, one randomly ranked set (where the observed protein-protein interactions [PPIs] take random ranks), a randomly rewired interaction set, and nine individual data sets. We compared these ranking schemes using benchmark reference sets derived from six independent experimental data sets, namely: (A) far-Western blotting, (B) isothermal titration calorimetry, (C) nuclear magnetic resonance, (D) surface plasmon resonance, (E) protein pairs in direct contact with each other within protein complex structures collected from PDB (contacts determined using the jointly buried surface area), and (F) mouse homologous protein interactions. We assumed that each reference set could only judge a PPI when each of its two proteins actually appeared in this reference set (judgeable). We assessed performance by coverage (number of true interactions) and accuracy (fraction of true interactions among judgeable interactions) as a function of varying ranking thresholds (when applicable). The higher-ranked thresholds are to the left in the plots. In the set scored by occurrence alone, multiple PPIs possessed the same score, and the second right-most symbol entry corresponds to the collection of PPIs with an observed occurrence ≥2.

IDBOS versus hypergeometric test rankings
The observed differences in Table 3 and Figure 2 between the IDBOS and hypergeometric test rankings warrants further comment. The fundamental difference between the IDBOS and hypergeometric test rankings lies in how they account for interactions between self-interacting proteins. For the IDBOS randomized data, each pair of proteins was re-assigned an interaction partner such that no self-interacting protein pairs remained in the final set used to construct the probability density distribution functions. In the hypergeometric test, self-interacting protein pairs were assigned finite probabilities of occurrence based on a background distribution for each protein pair, which was different from IDBOS. Conceptually, the sum of the probability of observing all interactions with a given protein A among all other proteins p AA , p AB , p AC , etc., was the same in both schemes. However, the constraint that p AA was zero in IDBOS and p AA was non-zero in the hypergeometric test, re-distributed the probabilities such that any interaction probability p AB between protein A and another protein B could be different in the two schemes. This strongly influenced the probability of detecting proteins that occurred with high frequency in the data set. The effect of including or excluding self-interacting protein pairs was magnified in the evaluation of interactions involving popular proteins. In the hypergeometric test, the likelihood of randomly generating self-interacting protein pairs is roughly proportional to the square of the number of times the protein appears in the data set. In practice, this leads to an underestimate of the occurrence of non-identity pairs in the random data, and this increases the significance attached to the observed non-identity PPIs involving popular proteins. The effect on ranking PPIs was considerable, e.g., among the top 20 occurrence-ranked interactions listed in Table 2, 11 appeared in the top 20 hypergeometric test ranking scheme. In fact, the first occurrence-ranked interaction between MDM2 and TP53 was still the second ranked interaction based on the hypergeometric test. Figure 3 shows the overlap of interactions among top-ranked PPIs based on IDBOS, the hypergeometric test, and frequency of occurrence rankings. As exemplified above, there was a noteworthy overlap (>0.40) between the ranking results in the frequency of occurrence and the hypergeometric test rankings for all ranks tested. In contrast, the first 10 3 top-ranked IDBOS PPIs showed low overlap (~0.10) with the frequency of occurrence ranked PPIs, but higher similarity (~0.36) to the ranked PPIs identified using the hypergeometric test. Thus, the high overlap between occurrence and hypergeometric rankings suggested that the inadvertent biases introduced by frequent investigations of popular proteins could not be completely disentangled by the hypergeometric-ranking scheme. Furthermore, we verified that consensus ranking schemes that included the average rank of all three ranking methods, or combining IDBOS and hypergeometric test rankings only, did not increase the accuracy beyond using IDBOS rankings (data not shown). Instead, we contend that the corresponding probability density distribution functions generated by IDBOS are the most appropriate to gauge unbiased PPIs in this data set.

Enrichment of known domain-domain interactions
The hypothesis that PPIs are mediated by a smaller set of specific domain-domain interactions (DDIs) that are repeatedly used, can be exploited by inferring DDIs from known PPIs and then predicting novel PPIs from the inferred DDI set [40,41]. A high-confidence PPI set, such as the collection of highly ranked PPIs, is more likely to contain reliable DDIs. To confirm this, we investigated The average accuracy < A > and gain associated with interaction detection based on shuffling (IDBOS), hypergeometric test ranking, and occurrence-ranking schemes were estimated using Equations 3 and 4. The value associated with the randomly ranked data set does not represent a random protein-protein interaction (PPI) data set as only the rank, and not the interactions themselves, were randomized. The tested reference sets contain PPIs derived from 1) far-Western blotting, 2) isothermal titration calorimetry, 3) nuclear magnetic resonance, 4) surface plasmon resonance experiments, 5) known protein crystal complexes contained in the Protein Data Bank (PDB), and 6) those derived from experimentally determined mouse PPIs. We tested whether the numerical differences in values of < A > for each pairwise comparison within each reference set were due to chance using the t-test, and rejected this hypothesis with a p-value of <10 -6 .
the fraction of known DDIs among all candidate domain pairs as a function of the rank threshold. As before, we only considered judgeable protein pairs whose domains appear in the set of known DDIs. The known DDI set was inferred from the PDB crystal structures [42]. As two random controls, we also randomly ranked and randomly rewired the judgeable protein pairs. Figure 4 shows that more stringent ranking thresholds (represented by fewer "Top judgeable PPIs" in Figure 4) resulted in higher enrichments of DDIs among all domain pairs between interacting proteins for both the IDBOS and frequency of occurrence rankings. This behavior was in contrast with the two random controls, which did not show such enrichments. Table 4 quantifies the improvement gained by ranking the data relative to no rankings. Within the same top number of PPIs, the IDBOS set had a higher enrichment of known DDIs than sets ranked either by the hypergeometric test or by frequency of occurrence, indicating a potential application of the IDBOS-ranked PPI sets for the inference of accurate DDIs.

Evaluation of ranking PPIs as a means of retrieving biological information
Up until now, the different scoring schemes were used to evaluate reference data sets that can be considered to be directly related to interacting proteins. We next evaluated the improvement that could be gained by using the differently ranked data sets to rank-order the interactions in reference data sets that are presumed to be enriched with interacting proteins, i.e., in sets of proteins that share the same Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, are implicated in the same disease, share Gene Ontology (GO) function, or tissue mRNA expression levels.

Case I: Enrichment of KEGG co-pathway gene pairs
Genes encoding interacting proteins are more likely to be part of the same pathway. Hence, the fraction of PPIs that are annotated to belong to the same pathway should be larger in sets containing higher-ranked PPIs. We used KEGG pathway data [43] to define co-pathway gene pairs, as outlined in the Materials and methods. For each investigated PPI set, we excluded protein pairs involving proteins that did not participate in any KEGG pathway (these were deemed to be non-judgeable). Figure 5A shows the results of this analysis. We confirmed the correlation between copathway membership of a protein pair and its rank threshold in both the IDBOS and the frequency of occurrence ranking schemes, as well as the absence of correlation in the two random controls. Overall, IDBOS ranking outperformed frequency of occurrence ranking by a factor of four for enrichment of co-pathway gene pairs among selected top-ranking PPIs (Table 5), and the gain seen in hypergeometric ranking was substantially smaller than that of IDBOS ranking. There is a discontinuity in the curve for the frequency of occurrence ranking scheme, which was caused by hundreds of interactions between three subunits of G protein (GNAL, GNB1, and GNGT1) and olfactory proteins which were supported by seven publications (Additional file 1). We reviewed these publications and found that this literature support was weak, with some based solely on gene expression evidence of those olfactory genes [44]. Among these hundreds of PPIs, those involved with GNB1 (beta subunit) or GNGT1 (gamma subunit) were not supported by KEGG pathway data, resulting in a sharp drop in accuracy.

Case II: Enrichment of co-disease gene pairs
Genomic variations may underlie different susceptibilities to disease, e.g., individuals with specific single nucleotide polymorphisms (SNPs) can develop, or be predisposed to, a particular disease phenotype. Furthermore, genes encoding interacting proteins are more likely to occur within the same disease classification [45]. Using the gene co-disease data extracted by Goh et al. [45] from the Online Mendelian Inheritance in Man (OMIM) Recovery of biological co-annotations from protein interactions. The proposed interaction detection based on shuffling (IDBOS) ranking scheme was compared with the frequency of occurrence ranking, one randomly ranked set (where the observed protein-protein interactions (PPIs) take random ranks), and a randomly rewired interaction set in retrieving biological relationships from diverse reference sets. The reference sets were: (A) co-pathway gene pairs, (B) co-disease susceptibility gene pairs, (C) functionally related gene pairs, and (D) tissue co-expressed gene pairs. We assessed performance by calculating the enrichment fraction of true PPIs among the judgeable interactions as a function of varying ranking thresholds. In the set scored by occurrence alone, multiple PPIs possessed the same score, and the second right-most symbol corresponds to the collection of PPIs with an observed occurrence ≥2. The fraction of true domain-domain interactions (DDIs) retrieved with interaction detection based on shuffling (IDBOS), hypergeometric test ranking, occurrence ranking, and randomly ranked data. We calculated the reported fraction and gain associated with the different ranking schemes as outlined in Equations 3 and 4. We tested whether the numerical differences in the fraction values for each pairwise comparison were due to chance using the t-test, and rejected this hypothesis with a p-value of <10 -24 .
dataset [46], we investigated the enrichment of co-disease gene pairs as a function of rank threshold ( Figure 5B, Table 5). A gene pair was termed a co-disease gene pair if their SNPs led to susceptibility to the same disease. With this co-disease gene pair set, we also extracted judgeable protein pairs whose encoding genes appeared in this codisease set (i.e., genes that have an actual disease annotation). Similar to the previous analyses, the IDBOS-scored set showed an enhanced probability of identifying existing disease-susceptible genes as compared with the frequency of occurrence ranking scheme, hypergeometric-based rankings, or random controls.
Case III: Enrichment of GO co-function gene pairs Figure 5C shows the ability of the IDBOS and frequency of occurrence ranking schemes to identify PPIs whose proteins share the same gene ontology (GO) function. Both ranking schemes displayed a similar rankdependent enrichment of co-function proteins compared to the randomly ranked (all ranks have the same importance) and randomly rewired (interactions are random) data set. Table 5 indicates that the improvements of the ranking schemes compared to randomly ranked data are relatively modest in retrieving functionally related gene pairs. As pointed out by Gillis and Pavlidis [28], the ability of PPI networks to distinguish co-functionality is largely influenced by the multifunctional nature of the gene annotations schemes themselves. In this case, the IDBOS, hypergeometric, and occurrence ranking schemes themselves will also have less influence on the retrieval of functionally related protein pairs compared to the annotation scheme itself.

Case IV: Enrichment of tissue co-expression gene pairs
Finally, we examined the occurrence of interacting protein pairs in a large data set that maps out the global mRNA expression levels for~20,000 human genes in 32 normal tissues [47]. For detail on the construction of this reference set, see Materials and methods. The biological hypothesis tested here examined co-expression of two mRNAs in the same tissue as indicative that the corresponding proteins had higher probability of interaction. Figure 5D shows the fraction of PPIs retrieved as a function of ranking scheme. Overall, the fraction of PPIs contained in the co-expression data set was lower than that of the other biological reference sets examined above. However, the investigated ranking schemes were able to produce ranked data sets that contain a slightly larger fraction of co-expressed PPIs than using no ranking (Table 5).

Topological differences of ranked PPI networks
Data sets of protein interactions are commonly represented as graphs or networks of protein interactions, and applying network topology metrics to characterize the biological role of PPIs has attracted wide attention [9,10,22,48]. The presence of confounding factors, e.g., protein abundance [49] or popularity [12], can strongly influence the topological properties of the network. Here, we used subsets of highly ranked PPIs derived from using either IDBOS or occurrence ranking to select smaller highconfidence networks. Figure 6A shows the overall networks constructed from the top-ranking PPIs corresponding to 4,425, 6,561, and 13,554 interactions selected based on the number of PPIs that have more than or equal to four, three, and two reported occurrences, respectively. Figure 6B shows the corresponding degree distribution, i.e., the distribution of the number of unique interacting partners each protein has (Degree), as well as the overall degree distribution for the entire network (All). Selecting highly ranked subsets of PPIs, using either IDBOS or occurrence, had the effect of reducing the number of high-degree proteins (hubs) in the overall network. This effect was most pronounced in the topranked IDBOS networks, which effectively down-weights interactions of popular proteins. The resulting topranked IDBOS networks were not dominated by hub interactions, but show a more distributed interaction network, with distinct topological and scaling properties.

Discussion
Single experiments designed for large-scale detection of PPIs judge multiple observations of the same PPI as The fraction of biological associations retrieved using interaction detection based on shuffling (IDBOS), hypergeometric test ranking, occurrence ranking schemes, and randomly ranked data. We calculated the reported fraction and gain associated with the different ranking schemes as outlined in Equations 3 and 4. The four reference sets tested contain putative biological relationships pertaining to protein interactions that are in the same Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway (co-pathway), implicated in the same disease (co-disease), share Gene Ontology (GO) function (co-function), or tissue expression levels (co-expression). We tested whether the numerical differences in the fraction values for each pairwise comparison within each references set were due to chance using the t-test, and rejected this hypothesis with a p-value of <10 -2 .
evidence for the confidence of the interaction. Multiple observations, across many different experiments, of a particular PPI also lend confidence that this interaction occurs in nature. One would then conclude that if one aggregated all known PPI experiments, the more times a particular PPI occurred, the more confident one would be of the interaction. We found a somewhat counterintuitive result in that ranking PPIs in order of the number of times a PPI has been observed was actually not the optimal way of assessing the importance of frequently reported interactions. Instead, we used the IDBOS method, which ranks interacting protein pairs in the observed PPI data sets as compared to random occurrences derived from an empirically reconstructed probability density distribution function specific to each interaction. We then used these ranks to order the PPI data and showed that this approach consistently identified more known protein associations in numerous benchmark reference sets than using either hypergeometric test rankings or simply frequency of occurrence. The improvement of IDBOS over the hypergeometric test results stems from the differences in the underlying random distributions in the two schemes. We reconstructed the corresponding top-ranked protein-protein interactions (PPIs) networks using both interaction detection based on shuffling (IDBOS) ranking and frequency of occurrence ranking. The columns of 4,425, 6,561, and 13,554 top-ranked interactions corresponds to selecting PPIs with ≥ 4, ≥ 3, and ≥ 2 reported occurrences, respectively. These interactions were distributed among roughly twice as many proteins using IDBOS ranking than in occurrence ranking. (B) The degree distribution for each selected PPI network was analyzed and compared to the distribution of all aggregated interactions (All). The higher-ranked PPI data sets were associated with fewer proteins that had a large number of interacting partners (hubs). As visualized in (A), the top-ranked networks were instead characterized by more evenly distributed interactions in a loosely interconnected network.
While the hypergeometric test does not impose any limitations on self-interacting proteins, by construction, the IDBOS procedure sets their probability of occurrence to zero even for the random case. The input data contain no self-interacting proteins; therefore, the corresponding random probability density distribution functions generated by IDBOS seem to be the most appropriate to gauge unbiased PPIs in this data set.
We evaluated the improvement in using ranked PPI data to identify known protein relationships in eleven independent reference sets, including direct and indirect readouts of protein interactions. We used a large number of different reference sets to perform as comprehensive an analysis as possible, as each reference set only covers a specific subset of protein interactions. It was clear that ranking interactions was of value, because higher-ranked PPIs had better success at retrieving protein associations. Across all sets investigated, occurrence ranking improved the performance by 46% compared to using non-ranked data, hypergeometric ranking achieved a 109% increase, and the IDBOS ranking scheme achieved a 134% increase. This conformed to the assumption that aggregating and creating high-confidence subsets of data creates added value, albeit at a cost of reducing the overall number of PPIs that can be considered.

Conclusions
We have developed a statistical approach to infer subsets of high-confidence human PPIs, and showed that ranked data can consistently enrich the accuracy of the retrieved PPI data in these sets. Our IDBOS method was more successful in ranking interactions than using either the number of times an interaction has been observed across experiments or rankings based on the hypergeometric test. Furthermore, using either IDBOS or hypergeometric scoring schemes generates unique ranks for almost all interactions, as opposed to the frequency of occurrence method, in which many interactions have the same integer score corresponding to the number of observed occurrences. The IDBOS ranking increased accuracy and enrichment of protein interaction data associated with PPIs by more than two-fold compared to simply ranking interactions based on observed occurrences. We achieved this improvement by comparing the observed interaction data with a probability density distribution function that does not inflate the statistical importance of interactions associated with frequently studied proteins. These results suggest that using our high-confidence protein interactions at different levels of confidence could help clarify the dependence on confidence on topological and biological properties associated with human protein networks.

Statistical scoring of PPIs from aggregated experimental data
We downloaded the collection of PPIs in October 2011 from the nine databases that covers the bulk of all known experimentally determined PPIs. Databases of non-primary nature, i.e., containing aggregated data and/ or predicted and inferred interactions, were excluded. From this collection, we extracted the subset of human physical PPIs, consisting of PPIs from both large-, semilarge-, and small-scale experimental studies. Because Reactome has no standardized annotations describing physical associations or direct interactions, instead we extracted PPIs annotated as "direct complexes." We treated each interaction reported by each study (identified by a unique publication ID) as a unique record by deleting redundant copies arising from the overlaps among the nine databases. In total, we analyzed 13,369 proteins, 80,980 PPIs, and 116,134 records (Additional file 1). The computational procedure to generate 10 6 random realizations of the PPI data set and compute the corresponding p-values and Zscores took~2,000 minutes on a dual core Xeon Irwindale 3.6 GHz 64-bit Linux server equipped with 4 GB of RAM. We used fractional rankings, i.e., PPIs that had the same score received the same ranking number, which is the mean of what they would have received under ordinal rankings. The Supplementary material provides the scored and ranked PPI data set, with equivalently ranked interactions tabulated in arbitrary order.
To rank interactions based on an alternative statistical method compared to IDBOS and frequency of occurrence, we included ranks based on a hypergeometric test [23]. For two proteins, i and j, given that the interaction between i and j occurred N ij times, the probability (p ij ) for these two proteins to have this or a larger number of interactions by chance was approximated using a hypergeometric distribution as follows: where N i (N j ) is the number of times protein i (j) was observed in the aggregated data set, and N is the total number of interactions in the aggregated data set of PPIs. This formulation is correct in the limit of N > > N i , N j , which was satisfied in this data set. Finally, we ranked the PPI data based on the calculated probabilities.

PDB interactions
We mapped all PDB sequences (downloaded in October 2010) to the human protein sequences stored in the Uni-ProtKB/Swiss-Prot database (downloaded in June 2008) and obtained 649 protein complexes that contained at least two different human proteins. We then calculated the contact area of each intra-complex human protein pair using the program EMPIRE [50], to determine which protein pairs interact. For protein pairs occurring in multiple complexes, we selected the pair with the largest contact area as an interacting pair. We collected 281 direct protein interactions between 563 proteins, with contact areas ranging from 0.1 to 183 nm 2 , as a reference set for known human PPIs derived from structural data (Additional file 2).

Reference sets from the HIPPIE database
We extracted four reference sets from the Human Integrated Protein-Protein Interaction rEferenc (HIPPIE) data set (http://cbdm.mdc-berlin.de/tools/hippie/ information. php, January 2012), each of which consists of PPIs detected by a top-scored experiment technique (score = 10, specified with European Bioinformatics Institute (EBI) term ID, and number of PPIs >100). The extracted data set contained PPIs derived from 1) far-Western blotting, 2) isothermal titration calorimetry, 3) nuclear magnetic resonance, and 4) surface plasmon resonance experiments.

Homologous human PPIs from mouse data
We extracted the subset of mouse physical PPIs from the above collection and constructed a homologous human PPI set according to the sequence homology between mouse and human proteins defined by the National Center for Biotechnology Information (NCBI) homology mapping scheme (ftp://ftp.ncbi.nih.gov/pub/HomoloGene/, October 2009). We used this homologous human PPI set (containing 3,148 interactions between 5,632 proteins) as a reference data set reflective of direct protein interactions.

Domain annotation and interaction data
We used the Pfam-A families of the Pfam21.0 database as the source for domain annotation [51]. We also downloaded the DDI set inferred from PDB crystal structures from the iPfam database [42]. This DDI set contains 4,030 interactions between 2,837 Pfam-A domains.

Co-disease susceptibility gene pairs
We used the disease susceptibility gene data of Goh et al. [45], which was constructed by processing OMIM raw data [46]. There were 43,249 pairs between 3,670 genes whose two genes were susceptible to at least one common disease. We used these pairs to represent the co-disease susceptibility gene pairs.

Co-function gene pairs
We downloaded the GO annotation data (www.geneontology.org, June 2010) to identify a set of functionally related gene pairs as a reference set. For each gene, we expanded its GO annotation list by including all ancestors of each member in the list. For two genes, g and h, given that they had n gh GO annotations in common, the probability (p gh ) for these two genes to have this or larger annotation overlap by chance was estimated from the hypergeometric distribution as follows: where N g (N h ) is the number of GO annotations of gene g (h) and T is the total number of unique GO annotations. We used -log(p gh ) as the score and chose the top 1% of all gene pairs as a reference set, resulting in 1,502,420 co-function pairs for this reference set.

Co-pathway gene pairs
The KEGG pathway data file (hsa_gene_map.tab) was downloaded from its Web site (ftp://ftp.genome.jp/pub/ kegg/pathway/, June 2010), which lists the pathways that each annotated gene participates in. Similar to the cofunction score described above, the co-pathway score was also computed using the hypergeometric distribution method. We chose the top 1% as the reference set of significant co-pathway gene pairs (139,841 pairs).

Tissue co-expression of gene pairs
We used a global expression data set containing mRNA expression levels for~20,000 human genes in 32 normal tissues derived from massively parallel signature sequencing [47]. The raw data set documents the abundance values of each short sequence signature tag (with a total of 182,727 tags) in 32 tissues. We mapped these tags to the regions of the human genome (hg18) that encode genes, in both orientations. We obtained 105,512 hits in the gene regions in both orientations and, among them, 68,855 hits in the gene orientation (p-value < 10 −2,000 ), indicating that the tags were able to distinguish the transcribed orientation from the non-transcribed orientation in the genome. Assigning the tags that hit a gene region and orientation to the corresponding gene, we obtained a set of tags for each gene, resulting in a total of 14,516 genes having non-empty tag sets. We summed up the abundance tissue profiles of the tags of a gene to create its raw expression profile. Furthermore, we computed the statistical significance of a gene being preferentially expressed in a tissue (termed "tissue specificity") using an approach similar to that of Yu et al. [52], which identified tissue-specific genes from the NCBI Expressed Sequence Tag database. Let e k (g) be the expression level of gene g in tissue k. The total expression of gene g in all 32 tissues is, and the expected total expression of all genes in tissue k is, If we randomly throw E(g) darts into 32 areas of sizes E k , k = 1, 2, . . .32, and each dart has a probability,