Improved homology-driven computational validation of protein-protein interactions motivated by the evolutionary gene duplication and divergence hypothesis

Frech, Christian; Kommenda, Michael; Dorfer, Viktoria; Kern, Thomas; Hintner, Helmut; Bauer, Johann W; Önder, Kamil

doi:10.1186/1471-2105-10-21

Methodology article
Open access
Published: 19 January 2009

Improved homology-driven computational validation of protein-protein interactions motivated by the evolutionary gene duplication and divergence hypothesis

Christian Frech¹,
Michael Kommenda¹,
Viktoria Dorfer¹,
Thomas Kern¹,
Helmut Hintner²,
Johann W Bauer² &
…
Kamil Önder^2,3

BMC Bioinformatics volume 10, Article number: 21 (2009) Cite this article

4555 Accesses
6 Citations
Metrics details

Abstract

Background

Protein-protein interaction (PPI) data sets generated by high-throughput experiments are contaminated by large numbers of erroneous PPIs. Therefore, computational methods for PPI validation are necessary to improve the quality of such data sets. Against the background of the theory that most extant PPIs arose as a consequence of gene duplication, the sensitive search for homologous PPIs, i.e. for PPIs descending from a common ancestral PPI, should be a successful strategy for PPI validation.

Results

To validate an experimentally observed PPI, we combine FASTA and PSI-BLAST to perform a sensitive sequence-based search for pairs of interacting homologous proteins within a large, integrated PPI database. A novel scoring scheme that incorporates both quality and quantity of all observed matches allows us (1) to consider also tentative paralogs and orthologs in this analysis and (2) to combine search results from more than one homology detection method. ROC curves illustrate the high efficacy of this approach and its improvement over other homology-based validation methods.

Conclusion

New PPIs are primarily derived from preexisting PPIs and not invented de novo. Thus, the hallmark of true PPIs is the existence of homologous PPIs. The sensitive search for homologous PPIs within a large body of known PPIs is an efficient strategy to separate biologically relevant PPIs from the many spurious PPIs reported by high-throughput experiments.

Background

Physical interactions between proteins, commonly referred to as protein-protein interactions (PPIs), occur at every level of cell function to elaborate the organism's phenotype. The study of PPIs is therefore of great interest and is helping to reveal basic molecular mechanisms of many diseases. High-throughput screening methods have given insight into hundreds of thousands of potential PPIs in several organisms. However, a major disadvantage of high-throughput approaches is their high rate of false-positive PPIs, i.e. erroneously reported PPIs that do not occur in vivo [1–7].

The development and implementation of computational methods for the validation of experimentally determined PPIs is therefore an important goal in bioinformatics today. Common approaches include determining intersections between different high-throughput PPI data sets [3], incorporating protein annotation data [5, 8], analyzing expression profiles [4, 9–12], investigating topological criteria of PPI networks [13–17], and inspecting patterns of co-evolution [18].

Another, well established in silico technique to validate an experimentally determined PPI is to check if homologs of the interacting proteins also interact; if so, the confidence of this PPI is increased. The original interolog concept suggests to examine PPIs among functionally conserved orthologs, i.e. functionally conserved proteins in other species that evolved from a common ancestor [19, 20]. However, large-scale application of this method for PPI validation is strongly hampered by limited coverage of most interactomes and by low numbers of known bona fide orthologs [21]. A first practical approach involved the inspection of PPIs among paralogous proteins, i.e. homologous proteins that evolved by gene duplication and are found within the same species [4]. Nevertheless, sensitivity remains a problem because in most organisms assured paralogs with known interactions are scarce. The strategy illustrated in Figure 1, which is followed in this paper, searches for homologous PPIs independent of species boundaries or functional constraints, which significantly increases the amount of PPI data usable for validation purposes (if not stated otherwise, the term 'homologous PPI' is understood as defined in Figure 1). Several papers applied this 'all-inclusive' approach to homology-based PPI validation [8, 22, 23]. Also techniques developed for PPI prediction, a relatively more well-studied bioinformatics problem, successfully utilized this idea, for example Brown et al. [24] or Jonsson et al. [25]. However, the focus of the present paper is not PPI prediction but the computational validation of experimentally determined PPIs.

Firstly, we draw the reader's attention to the duplication-divergence hypothesis of PPI evolution, i.e. the idea that extant PPIs primarily originate from gene duplications, the homologs diverging over time. If PPIs share common evolutionary ancestry, which is what this hypothesis suggests, then this ancestry reaches far into the evolutionary past. Consequently, homology-based PPI validation should investigate also diverged homologs and not only similar proteins.

Secondly, motivated by this idea, we propose an improved, sequence-based procedure for homology-based PPI validation. Unlike previously published, mostly binary validation schemes that deem a questioned PPI as biologically relevant as soon as a single homologous PPI is found, we follow a similar approach as Jonsson et al. [25] and compute a confidence score that takes into account both the quality and quantity of all identified homologous PPIs. The assignment of higher scores to high-quality hits and of lower scores to low-quality hits allows us to extend the search for homologous PPIs from reliable homologs to highly putative paralogs and orthologs with E-values up to 10. In addition to similar scoring schemes proposed before, we normalize and combine scores obtained from different homology search strategies.

Thirdly, we demonstrate the high efficacy of homology-based validation when carried out on large PPI data sets. A comprehensive data set of known physical binary PPIs from six PPI source databases is compiled, comprising 135,276 PPIs from 20 different organisms. This is, to the best of our knowledge, the largest collection of PPIs that has been used so far in this kind of analysis. Based on Receiver Operating Characteristic (ROC) curves it is shown that the new approach improves over previous methods for homology-based PPI validation.

Results and discussion

Duplication-Divergence Hypothesis of PPI Evolution

Gene duplication is a ubiquitous mechanism in molecular evolution and the principal source of biological innovation, producing new proteins and novel functional domains [26–30]. Here, we follow the idea that the duplication of genetic material coupled with subsequent divergence is also the dominant mechanism for the development of novel PPIs [31]. This hypothesis is supported by both theoretical models [32–34] and empirical evidence [35–40]. A brief review of papers supporting the duplication-divergence hypothesis can be found in Additional file 1.

Duplication-divergence models of PPI evolution propose a simple and yet plausible idea of how evolution might have formed PPI networks over millions of years – by repeated duplication of interacting genes followed by their divergence. Figure 2 illustrates this idea.

Implications of the Duplication-Divergence Model for Homology-Based PPI Validation

The duplication-divergence model of PPI evolution as shown in Figure 2 suggests that most biologically relevant PPIs descend from a common ancestral PPI, i.e. the PPIs are homologous to each other. This allows assessing the plausibility of an experimentally determined PPI as follows: for a true-positive PPI, one expects to see many homologous PPIs, whereas for a false-positive PPI this should be less likely. A sensitive search for homologous PPIs should thus, in principle, be able to filter out large numbers of false-positive PPIs from experimental PPI data sets while retaining the bulk of true-positive PPIs. However, both the incompleteness of today's PPI data sets and the fact that common ancestry is often elusive represent major practical obstacles along the way.

For assessing the validity of an experimentally predetermined PPI, also 'weak' homologous PPIs can be informative ('weak' in the sense of 'weak signal of homology'). Their incidence might support or contradict the idea that a questioned PPI evolved by duplication-divergence, which in turn can strengthen or weaken the position that a PPI is biologically relevant. However, the value of weak homologous PPIs for PPI prediction is limited: one cannot infer from a given pair of interacting proteins that very distant homologs interact as well. In the majority of cases this prediction would be simply wrong, because due to divergence most duplicated PPIs are eventually lost. PPIs inferred by homology are thus only trustworthy if protein similarity is high [41, 42] or if these PPIs are supported by complementary data [24].

Weak Homologous Interactions – Signal or Noise?

If the duplication-divergence model of PPI evolution is correct, the existence of weak homologous PPIs should be an observable characteristic of biologically relevant PPIs. We set out to test this hypothesis. For both Gold Standard Positive (GSP) and Gold Standard Negative (GSN) data sets, PSI-BLAST was used to search for homologous PPIs, and their distribution was determined within different E-value windows. Figure 3 shows the results. It reveals two important differences between GSP PPIs and GSN PPIs. Firstly, there is an increased probability for GSP PPIs to have at least one paralogous or orthologous PPI. Secondly, significantly more GSP PPIs than GSN PPIs have large numbers of paralogous and orthologous PPIs (>10). Most interestingly, both differences are observed up to high E-value windows.

Not surprisingly, the first characteristic, the existence of at least one homologous PPI, is a highly reliable signal for GSP PPIs when sequence similarity is high. For example, almost every fifth PPI taken out of the GSP data set (18%) has a homologous PPI with an E-value lower than 10^-100 (Figure 3A). By contrast, the existence of such high-quality homologs is extremely unlikely for a GSN PPI (0.25%). The signal remains intact with very low levels of sequence similarity: within the last E-value window (ranging from 3 to 10) the probability of observing a homologous PPI for a GSP PPI remains still twice as high (63%) as for a GSN PPI (30%).

The distribution of homologous PPIs reveals the second interesting characteristic of GSP PPIs: they tend to accumulate large numbers of homologous PPIs. According to Figure 3A, in all windows with E-values greater than 10^-20, about 25% of GSP PPIs have more than 10 homologous PPIs; for GSN PPIs, this percentage never exceeds 8%. Thus, for many PPIs the existence of a large number of homologous PPIs is more conclusive than the existence of at least one homologous PPI, especially when sequence similarity is low: whereas about twice as many GSP PPIs than GSN PPIs have at least one homologous PPI within the last E-value window, almost four times as many GSP PPIs (21.4%) than GSN PPIs (5.8%) have between 10 and 100 homologs, and more than five times as many GSP PPIs (6.8%) than GSN PPIs (1.3%) have more than 100 homologs.

Both characteristics are observed independently of the fact whether only paralogous (Figure 3B) or only orthologous PPIs (Figure 3C) are investigated. For lower numbers of homologous PPIs, stronger signals on the paralogous data set are obtained than on the orthologous data set, which is consistent with the finding that PPIs seem to be more conserved within species than across species [41]. Interestingly, very large numbers (>100) of homologous PPIs are observed within the orthologous data set, most likely due to an increased number of gene duplications in higher eukaryotes.

We conclude that weak homologous PPIs are indeed an observable and distinguishing characteristic of biologically relevant PPIs, especially if they are observed in increased numbers. Consequently, weak homologous PPIs should be considered by homology-based PPI validation schemes.

Overall Performance

We devised a scoring scheme that incorporates the findings from Figure 3 (see Methods). The Receiver Operating Characteristic (ROC) curves in Figure 4 illustrate the overall performance of this scoring scheme for the MIPS, the Small Scale and the Multiple Evidence gold standard data sets. The y-axis shows the True-Positive Rate (TPR or sensitivity), i.e. the percentage of GSP PPIs that were correctly confirmed as biologically relevant. The x-axis represents the False-Positive Rate (FPR or 1-specificity), i.e. the percentage of GSN PPIs that were erroneously confirmed as biologically relevant. By varying the threshold of the score above which a PPI is confirmed as biologically relevant, different FPRs and TPRs are observed (a short introduction to ROC curves can be found in Additional file 1). For example, on the MIPS and the Multiple Evidence data sets, a TPR of more than 70% at an FPR of 10% is observed. An increased threshold results in a TPR of 80% at an FPR of 20% for the same two data sets. These values compare well to TPRs and FPRs reported by other, non-homology-based PPI validation techniques [11–14, 18]. This underscores the high efficacy of homology-based PPI validation, especially when carried out on rich PPI data sets.

The Small Scale gold standard performs worst, which might reflect differences in the quality of the data sets. The Multiple Evidence gold standard can be considered of highest quality, because detection of a PPI with different experimental methods is a very reliable indicator of its existence [3]. Indeed, this data set achieves the best performance up to a TPR of 70%. The MIPS gold standard contains PPIs audited by human experts and is thus very trustworthy as well, although a slightly poorer performance is seen on this data set. PPIs of the Small Scale gold standard are not reviewed manually and thus its reliability might to a large degree reflect the quality of the automated text mining tools that are frequently used to extract them from the scientific literature. Because these tools are error-prone [43], the Small Scale gold standard might contain more spurious PPIs than the other two data sets.

Note that although the class distribution in the gold standard data sets is skewed, i.e. the Random GSN data set is about 50 times larger than each GSP data set, this does not affect the overall ROC curve [44]. In fact, we observed the same overall ROC curve on a balanced data set where the number of randomly chosen GSNs roughly equals the number of GSPs (data not shown). ROC curves as shown in Figure 4 are ideal to illustrate the overall performance of a classifier, but do not make suggestions about which specific score threshold should be applied to classify a PPI as true or false. This decision depends on the TPR and FPR one is willing to accept. Supplementary Figure 1 shows selected score thresholds and their associated TPRs and FPRs.

Comparison of Homology-Based Validation Schemes

Previous homology-based PPI validation methods involve simpler, binary selection processes in which a PPI is deemed to be biologically relevant as soon as a single homologous PPI is found [4, 8, 23]. Figure 5 shows a performance comparison with two of these methods. Note that this comparison does not include homology-based PPI prediction techniques, although these techniques are widely used. The reason is that these techniques have a different focus and generally incorporate also non-homology-based criteria, which makes a direct comparison difficult.

The FASTA-based binary validation scheme shows remarkably high specificity, even at high E-values. For example, inclusion of homologous PPIs with E-values between 10^-4 and 10 results in an increase of the TPR of almost 25% (from 48.5% to 72.1%), whereas the FPR grows only by 13.5% within the same interval, remaining below 16%. Considering the low sequence similarity and the probable existence of many spurious hits at E-values up to 10, this find is remarkable. By contrast, the PSI-BLAST-based binary validation scheme is less specific (even at low E-values), but much more sensitive: almost 87% of the GSP PPIs have at least one homologous PPI identified by PSI-BLAST (E-value ≤ 10). In addition to FASTA and PSI-BLAST, also BLAST was evaluated for homology detection (data not shown). In comparison to FASTA, no noticeable difference in performance was observed except for a slight decrease in maximum sensitivity (about 2% lower than with FASTA). Our scoring scheme, represented by the blue curve (squares), combines evidence from homologous PPIs found by FASTA and PSI-BLAST and clearly outperforms both individual binary validation schemes. For example, at a TPR of 70% it produces 4% fewer false-positives than the FASTA-based binary approach, and about 6% fewer false-positives than the PSI-BLAST-based binary validation scheme.

Previous homology-based validation schemes suggested different parameter settings. Saeed and Deane [23] used BLAST with an E-value up to 10^-4 to identify homologous PPIs and evaluated a TPR of 63% at an FPR of 7%. If this setting is transferred to FASTA and applied on our data sets, a TPR of 48.5% at an FPR of 2.3% is observed. Our scoring scheme, by contrast, produces only 1.5% false-positives at the same level of sensitivity (48.5%). The difference in FPR increases for higher levels of sensitivity, illustrating the additional value from incorporation of multiple methods for homology detection and from consideration of weak homologs. Patil and Nakamura [8] used PSI-BLAST with an E-value up to 10^-8 and reported a TPR of 89.7% at an FPR of 37.1% for their gold standards. On our data set the same parameter setting results in a significantly worse TPR of 70.1% at an FPR of 16.1%. Again, the scoring scheme outperforms and achieves a reduced FPR of 10% at the same level of sensitivity (70.1%). It is noteworthy that our exclusively sequence-based scoring scheme produces a superior ROC curve than Patil and Nakamura's Bayesian network approach, which incorporates three genomic features instead of one (sequence, structure and annotation information). This underscores again the potential efficacy of homology-based methods. No published PSI-BLAST parameters were found in the paper from Deane et al. [4], and thus the performance of this method was not assessed.

Contribution of Weak Homologs

Is it actually beneficial to include homologous PPIs with high E-values (>1) for PPI validation, i.e. do weak homologs indeed contribute positively in terms of increased sensitivity and/or increased specificity? To answer this question, the classification performance of the scoring scheme with and without the inclusion of weak homologs was determined. Figure 6 shows the results.

The inclusion of weak homologous PPIs contributes positively to the overall classification performance. For example, the restriction of the analysis to homologous PPIs with an E-value below 10^-10 results in a maximum TPR of 69% at an FPR of 14.5%. When homologs with E-values up to 1 are considered, the same sensitivity is achieved at a significantly reduced FPR of 10%. Another increase of the E-value threshold up to 10 leads to a further reduction of the FPR by 1%.

Note that a similar effect cannot be observed for the classic, binary validation schemes, where less stringent E-value thresholds increase sensitivity but decrease specificity (Figure 5). This emphasizes the value of the scoring approach: it finds evidence for biologically relevant PPIs among weaker homologs without the compromise of an increased rate of false-positives.

Although the additional benefit resulting from the inclusion of weak homologs with an E-value above 1 is rather low on this data set, we expect it to increase for data sets where high-quality homologs are not at hand. Yeast is comparably well investigated, with approximately 50% of its estimated 40,000 to 75,000 PPIs known [45]. As a consequence, most of its biologically relevant PPIs have homologous PPIs among high-quality paralogs, and matches among weak homologs add little extra value to the overall score. This situation is different from most other organisms where interactome coverage is far below 50% and where weak paralogs and orthologs are often the only possibility to validate a questioned PPI.

Conclusion

Knowledge of PPIs is key to understanding cell function. Although experimental high-throughput PPI detection techniques are now making it possible to catalogue all PPIs of a cell, the notoriously high error rates of these methods are a major obstacle to achieving this ambitious goal. Computational methods that can efficiently separate the PPI wheat from the chaff are therefore highly desirable.

We think that recent insights into the evolution of PPIs, in particular the duplication-divergence hypothesis, might be crucial to this endeavor. Nature is a tinkerer, not an inventor [46]. For PPIs this means that new PPIs are primarily derived from preexisting PPIs rather than invented de novo. Consequently, most true-positive PPIs must have homologous PPIs, not only among highly similar proteins, but also among distantly related proteins. This important characteristic of biologically relevant PPIs should, in principle, allow successful discrimination between true and false PPIs.

In light of this consideration, homology-based validation techniques seem promising, but have not gained much attention so far. Literature searches revealed only five papers that proposed a homology-based technique to validate experimental PPIs on a large scale, only three of which presented a critical performance assessment. Presumably this reflects the fact that homology-based validation requires having at hand a set of PPIs among homologous proteins, when few such PPIs have been known. However, with more and more PPIs now being reported from high-throughput experiments, this limitation is no longer factor.

In this paper, we assembled a large PPI data set to reassess the performance of homology-based PPI validation. It was shown that the classic, binary validation technique is efficient on such data sets, but can be further improved by using multiple methods for homology detection and more remote homologs to complement close homologs.

We expect the findings to be most relevant in situations where interactions among assured paralogs or orthologs are not at hand and thus traditional homology-based validation is not an option. Existing PPI databases could use the proposed method to reduce their number of false-positives without losing too many true-positives, especially within well explored model organisms. Other prospective applications include the elucidation of physically interacting proteins from known protein complexes, or the validation of in silico predicted PPIs in cases where homology was not used as a criterion for prediction in advance. Prospective improvements may involve more sophisticated methods for homology detection (e.g. Profile-HMMs), identification of PPI-mediating protein features (e.g. interacting domains) prior to homology detection to refine the selection of homologs, and an assessment of the statistical significance (P-values) of computed scores to obtain an intuitive measure of a PPI's validity.

Methods

Database Search for Homologous PPIs

The modular architecture of proteins implies that a protein has not just one distinct evolutionary trajectory, but one for each biological feature it contains [47]. Since in general PPI-mediating features (e.g. the domains) are unknown, one cannot selectively examine only trajectories of relevance. One possibility is to examine evolutionary trajectories of all protein features, but this comes with an increased risk of detecting PPIs that are not truly homologous. Also, if one wishes to include weak paralogs and orthologs in the analysis to capture gene duplication events that happened long time ago, methods for homology detection become inaccurate and produce spurious hits – another source of false homologous PPIs. To maximize both sensitivity and specificity despite these difficulties, we opt for a large-scale, sequence-based screening procedure in combination with a scoring scheme. Given an experimentally determined PPI, both FASTA [48] and PSI-BLAST [49] are used to search for homologous PPIs. FASTA supplies more reliable results for closely related proteins, while PSI-BLAST is more sensitive for remote relationships [50]. To further increase the sensitivity of the method, local sequence similarities with E-values up to 10 are considered. This produces many spurious hits, and thus traditional homology-based PPI validation techniques that simply check for the existence of a single homologous PPI become misleading (compare Figure 3). We therefore follow a similar approach as Jonsson et al. [25] and apply a scoring scheme that weighs each match according to its sequence similarity: low E-values score high, and high E-values score low. Thus high scores can result from few high-quality hits but also from numerous low-quality hits.

Homologous PPIs are searched within a subset of the Protein Interaction and Molecule Search (PRIMOS) database http://primos.fh-hagenberg.at, release BETA-2.7/2007–04 [51]. This subset consists of 135,276 redundancy-removed, physical binary PPIs between 42,288 proteins from 20 organisms, imported from six primary PPI databases [52–57] (see Additional file 1). We used both FASTA (fasta34.exe, v3.4) and PSI-BLAST (blastpgp.exe, v2.2.16) to determine homologs for all proteins of our gold standard data sets. The search space for homologous proteins was restricted to the set of 42,288 proteins with known PPIs. According to Figure 1, we considered a homologous PPI as an interaction found between a pair of homologous proteins. E-value thresholds for both programs were set to 10, the ktup parameter of FASTA was set to 1, and the number of iterations for PSI-BLAST was set to 10. All other program parameters were left default.

Gold Standard Data Sets

Four gold standard positive (GSP) data sets and one gold standard negative (GSN) data set with PPIs from Saccharomyces cerevisiae are used for performance assessment. Yeast is relatively well-studied, which allows being rather stringent in the selection of the GSP data sets. In addition, yeast has already been used numerous times in similar studies, which eases the comparison with previous results.

The MIPS GSP data set comprises 1,541 physical binary PPIs obtained from the Comprehensive Yeast Genome Database (CYGD) [57]. This database is considered as a high-quality resource for yeast PPIs and is frequently used as a gold standard reference set. PPIs reported by high-throughput experiments are excluded from this data set (see Additional file 1). The Multiple Evidence GSP data set consists of 393 PPIs reported by at least two experimental methods and in at least two different publications. As an additional criterion, only publications imported from one PPI database are considered. For example, if a PPI is reported from a publication contained in DIP and MINT, it will be excluded from the data set. If DIP is the only source database for this PPI, the PPI will be included. In an integrated dataset compiled from multiple source data sets this procedure reduces the risk that duplicate PPIs are regarded as homologous. The Small Scale data set consists of yeast PPIs reported by 'small-scale' experiments and contains 902 PPIs. Only PPIs of publications with up to three reported PPIs are considered. To minimize the risk of duplicate PPIs, publications imported from more than one primary PPI database are excluded (same procedure as for the Multiple Evidence GSP). The three GSP data sets overlap only to a low degree: just 8 PPIs are common to all three data sets, 25 between MIPS and Small Scale, 86 between Small Scale and Multiple Evidence, and 10 between MIPS and Multiple Evidence. The Combined GSP data set contains all PPIs from the previous three GSP data sets (2,723 PPIs in total).

The Random GSN PPI data set was generated by randomly selecting 50,000 protein pairs out of 7,058 yeast proteins (UniProt [58] release 10.0, downloaded on March 29, 2007) that were not found interacting within the PRIMOS database. A randomly selected data set is not completely free of real PPIs, but has no selection bias, for example towards protein pairs with different molecular functions [23]. The amount of real PPIs within such a randomly selected GSN data set should be generally low at about 0.25% [59].

Scoring Scheme

The score S(a, b) of a queried interaction between two proteins a and b is defined as

S (a, b) = \sum_{o \in O} \sum_{\begin{matrix} (h_{a}, h_{b}) \in \\ H_{a} (o) \times H_{b} (o) \end{matrix}} {\begin{array}{l} sim (h_{a}) sim (h_{b}) & h_{a} interacts with h_{b} \\ 0 & otherwise \end{array}

(1)

where O is the set of organisms with known experimental PPIs in the PRIMOS database, H_a(o) and H_b(o) denote the sets of proteins from organism o that are homologous to protein a and b, respectively. If there is experimental evidence for an interaction between homolog h_aand homolog h_bin the PRIMOS database, a score proportional to their sequence similarity is added to an overall sum.

Note that the computation of S(a, b) excludes homologous PPIs where the two proteins are from different organisms. Homologous PPIs from the same organism with one protein identical to one of the source PPI proteins are allowed. In this case, the E-value of the identical protein is assumed to be 0. Furthermore, if two identical homologous PPIs are found in an organism, i.e. two pairs $(h_{a_{1}}, h_{b_{1}})$ and $(h_{a_{2}}, h_{b_{2}})$ where $h_{a_{1}} = h_{b_{2}}$ and $h_{b_{1}} = h_{a_{2}}$ , then only the homologous PPI with the lower E-value is considered. The other PPI is ignored. The E-value of a homologous PPI (h_a, h_b) is defined as max(evalue(h_a), evalue(h_b)). The similarity measure sim(x) of a homologous protein x is defined as

sim (x) = {\begin{array}{l} 300 & evalue (x) = 0 \\ - \log_{10} (\frac{evalue (x)}{100}) & otherwise \end{array}

(2)

where evalue(x) is the E-value of homolog x reported by FASTA and PSI-BLAST, respectively (note that FASTA and PSI-BLAST scores are computed independently, see below). For each pair of interacting homologs, the scoring scheme basically extracts the positive exponent of the two reported E-values (-log₁₀) and multiplies these exponents to get a joint similarity measure proportional to the similarity of both homologs. The total score is then the sum over all pairs of interacting homologs. Since a maximum E-value of 10 is allowed, division by 100 ensures that the negative logarithm is positive over the full range of possible E-values. The logarithm of zero is undefined, so E-values of zero are assigned the negative logarithm of roughly the smallest reported E-value greater than zero (10^-300). This scoring scheme is similar to those proposed by Jonsson et al. [25], but uses more interpretable E-values instead of bit scores and puts more weight on the individual similarities of the two proteins (product of logarithms instead of logarithm of products). We found this weighing scheme important for rewarding high-quality hits where both homologs exhibit a high-degree of similarity, in which case the PPI in question is almost always true [23]. In addition, the score is then normalized, so that individual scores from different search strategies can be compared and combined.

S_{norm} (a, b) = \frac{S (a, b)}{S_{\max} (a, b)}

(3)

where S_max(a, b) is defined as S(a, b) with all h_aassumed as interacting with all h_b. This scales the score to values ranging from 0 (minimum score) to 1 (maximum score).

Two normalized scores are computed independently, one with the homologs identified by FASTA and one with the homologs identified by PSI-BLAST. The final score is defined as the arithmetic mean of both normalized scores:

S_{final} (a, b) = \frac{S_{norm}^{FASTA} (a, b) + S_{norm}^{PSIBLAST} (a, b)}{2}

(4)

References

Mrowka R, Patzak A, Herzel H: Is there a bias in proteome research? Genome Res 2001, 11(12):1971–1973. 10.1101/gr.206701
Article CAS PubMed Google Scholar
Legrain P, Wojcik J, Gauthier JM: Protein-protein interaction maps: a lead towards cellular functions. Trends Genet 2001, 17(6):346–352. 10.1016/S0168-9525(01)02323-X
Article CAS PubMed Google Scholar
von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417(6887):399–403. 10.1038/nature750
Article CAS PubMed Google Scholar
Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1(5):349–356. 10.1074/mcp.M100037-MCP200
Article CAS PubMed Google Scholar
Sprinzak E, Sattath S, Margalit H: How reliable are experimental protein-protein interaction data? J Mol Biol 2003, 327(5):919–923. 10.1016/S0022-2836(03)00239-0
Article CAS PubMed Google Scholar
Gilchrist MA, Salter LA, Wagner A: A statistical framework for combining and interpreting proteomic datasets. Bioinformatics 2004, 20(5):689–700. 10.1093/bioinformatics/btg469
Article CAS PubMed Google Scholar
Edwards AM, Kus B, Jansen R, Greenbaum D, Greenblatt J, Gerstein M: Bridging structural biology and genomics: assessing protein interaction data with known complexes. Trends Genet 2002, 18(10):529–536. 10.1016/S0168-9525(02)02763-4
Article CAS PubMed Google Scholar
Patil A, Nakamura H: Filtering high-throughput protein-protein interaction data using a combination of genomic features. BMC Bioinformatics 2005, 6: 100. 10.1186/1471-2105-6-100
Article PubMed Central PubMed Google Scholar
Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res 2002, 12: 37–46. 10.1101/gr.205602
Article PubMed Central CAS PubMed Google Scholar
Kemmeren P, van Berkum NL, Vilo J, Bijma T, Donders R, Brazma A, Holstege FCP: Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol Cell 2002, 9(5):1133–1143. 10.1016/S1097-2765(02)00531-2
Article CAS PubMed Google Scholar
Deng M, Sun F, Chen T: Assessment of the reliability of protein-protein interactions and protein function prediction. Pac Symp Biocomput 2003, 140–151.
Google Scholar
Tirosh I, Barkai N: Computational verification of protein-protein interactions by orthologous co-expression. BMC Bioinformatics 2005, 6: 40. 10.1186/1471-2105-6-40
Article PubMed Central PubMed Google Scholar
Saito R, Suzuki H, Hayashizaki Y: Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics 2003, 19(6):756–763. 10.1093/bioinformatics/btg070
Article CAS PubMed Google Scholar
Goldberg DS, Roth FP: Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci USA 2003, 100(8):4372–4376. 10.1073/pnas.0735871100
Article PubMed Central CAS PubMed Google Scholar
Bader JS, Chaudhuri A, Rothberg JM, Chant J: Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol 2004, 22: 78–85. 10.1038/nbt924
Article CAS PubMed Google Scholar
Chen J, Hsu W, Lee ML, Ng SK: Discovering reliable protein interactions from high-throughput experimental data using network topology. Artif Intell Med 2005, 35(1–2):37–47. 10.1016/j.artmed.2005.02.004
Article PubMed Google Scholar
Pei P, Zhang A: A topological measurement for weighted protein interaction network. Proc IEEE Comput Syst Bioinform Conf 2005, 268–278.
Google Scholar
Tan SH, Zhang Z, Ng SK: Automated Detection and Validation of Interaction by Co-Evolution. Nucleic Acids Res 2004, (32 Web Server):W69-W72. 10.1093/nar/gkh471
Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res 2001, 11(12):2120–2126. 10.1101/gr.205301
Article PubMed Central CAS PubMed Google Scholar
Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thierry-Mieg N, Vidal M: Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 2000, 287(5450):116–122. 10.1126/science.287.5450.116
Article CAS PubMed Google Scholar
Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 2001, 314(5):1041–1052. 10.1006/jmbi.2000.5197
Article CAS PubMed Google Scholar
Suthram S, Shlomi T, Ruppin E, Sharan R, Ideker T: A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics 2006, 7: 360. 10.1186/1471-2105-7-360
Article PubMed Central PubMed Google Scholar
Saeed R, Deane C: An assessment of the uses of homologous interactions. Bioinformatics 2007.
Google Scholar
Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics 2005, 21(9):2076–2082. 10.1093/bioinformatics/bti273
Article CAS PubMed Google Scholar
Jonsson PF, Cavanna T, Zicha D, Bates PA: Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinformatics 2006, 7: 2. 10.1186/1471-2105-7-2
Article PubMed Central PubMed Google Scholar
Zhang J: Evolution by gene duplication: an update. Trends in Ecology and Evolution 2003, 18(6):292–298. 10.1016/S0169-5347(03)00033-8
Article Google Scholar
Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 2001, 313(4):903–919. 10.1006/jmbi.2001.5080
Article CAS PubMed Google Scholar
Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 2001, 307(4):1113–1143. 10.1006/jmbi.2001.4513
Article CAS PubMed Google Scholar
Murzin AG: How far divergent evolution goes in proteins. Curr Opin Struct Biol 1998, 8(3):380–387. 10.1016/S0959-440X(98)80073-0
Article CAS PubMed Google Scholar
Ohno S: Evolution by gene duplication. Springer-Verlag; 1970.
Book Google Scholar
Levy ED, Pereira-Leal JB: Evolution and dynamics of protein interactions and networks. Curr Opin Struct Biol 2008, 18(3):349–357. 10.1016/j.sbi.2008.03.003
Article CAS PubMed Google Scholar
Evlampiev K, Isambert H: Modeling protein network evolution under genome duplication and domain shuffling. BMC Syst Biol 2007, 1: 49. 10.1186/1752-0509-1-49
Article PubMed Central PubMed Google Scholar
Vázquez A, Flammini A, Maritan A, Vespignani A: Modeling of Protein Interaction Networks. Complexus 2003, 1: 38–44. 10.1159/000067642
Article Google Scholar
Pastor-Satorras R, Smith E, Solé RV: Evolving protein interaction networks through gene duplication. J Theor Biol 2003, 222(2):199–210. 10.1016/S0022-5193(03)00028-6
Article CAS PubMed Google Scholar
Light S, Kraulis P, Elofsson A: Preferential attachment in the evolution of metabolic networks. BMC Genomics 2005, 6: 159. 10.1186/1471-2164-6-159
Article PubMed Central PubMed Google Scholar
Pereira-Leal JB, Teichmann SA: Novel specificities emerge by stepwise duplication of functional modules. Genome Res 2005, 15(4):552–559. 10.1101/gr.3102105
Article PubMed Central CAS PubMed Google Scholar
Amoutzias GD, Robertson DL, Oliver SG, Bornberg-Bauer E: Convergent evolution of gene networks by single-gene duplications in higher eukaryotes. EMBO Rep 2004, 5(3):274–279. 10.1038/sj.embor.7400096
Article PubMed Central CAS PubMed Google Scholar
Teichmann SA, Babu MM: Gene regulatory network growth by duplication. Nat Genet 2004, 36(5):492–496. 10.1038/ng1340
Article CAS PubMed Google Scholar
van Noort V, Snel B, Huynen MA: The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Rep 2004, 5(3):280–284. 10.1038/sj.embor.7400090
Article PubMed Central CAS PubMed Google Scholar
Eisenberg E, Levanon EY: Preferential attachment in the protein network evolution. Phys Rev Lett 2003, 91(13):138701. 10.1103/PhysRevLett.91.138701
Article PubMed Google Scholar
Mika S, Rost B: Protein-protein interactions more conserved within species than across species. PLoS Comput Biol 2006, 2(7):e79. 10.1371/journal.pcbi.0020079
Article PubMed Central PubMed Google Scholar
Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JDJ, Bertin N, Chung S, Vidal M, Gerstein M: Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 2004, 14(6):1107–1118. 10.1101/gr.1774904
Article PubMed Central CAS PubMed Google Scholar
Jose H, Vadivukarasi T, Devakumar J: Extraction of protein interaction data: a comparative analysis of methods in use. EURASIP J Bioinform Syst Biol 2007, 53096.
Google Scholar
Fawcett T: ROC Graphs: Notes and Practical Considerations for Researchers. Machine Learning 2004.
Google Scholar
Hart GT, Ramani AK, Marcotte EM: How complete are current yeast and human protein-interaction networks? Genome Biol 2006, 7(11):120. 10.1186/gb-2006-7-11-120
Article PubMed Central PubMed Google Scholar
Jacob F: Evolution and tinkering. Science 1977, 196(4295):1161–1166. 10.1126/science.860134
Article CAS PubMed Google Scholar
Kim Y, Koyutürk M, Topkara U, Grama A, Subramaniam S: Inferring functional information from domain co-evolution. Bioinformatics 2006, 22: 40–49. 10.1093/bioinformatics/bti723
Article CAS PubMed Google Scholar
Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85(8):2444–2448. 10.1073/pnas.85.8.2444
Article PubMed Central CAS PubMed Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389
Article PubMed Central CAS PubMed Google Scholar
Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998, 284(4):1201–1210. 10.1006/jmbi.1998.2221
Article CAS PubMed Google Scholar
51. Straßer W, Siegl D, Önder K, Bauer J: InSilico Proteomics System: Integration and Application of Protein and Protein-Protein Interaction Data using Microsoft .NET. Journal of Integrative Bioinformatics 2006.,3(2)
Google Scholar
Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW: BIND-The Biomolecular Interaction Network Database. Nucleic Acids Res 2001, 29: 242–245. 10.1093/nar/29.1.242
Article PubMed Central CAS PubMed Google Scholar
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, (32 Database):D449-D451. 10.1093/nar/gkh086
Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TKB, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, Khosravi-Far R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JGN, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti A, Pandey A: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13(10):2363–2371. 10.1101/gr.1680803
Article PubMed Central CAS PubMed Google Scholar
55. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res 2004, (32 Database):D452-D455. 10.1093/nar/gkh052
Google Scholar
56. Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Res 2007, (35 Database):D572-D574. 10.1093/nar/gkl950
Google Scholar
57. Güldener U, Münsterkötter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stümpflen V: MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res 2006, (34 Database):D436-D441. 10.1093/nar/gkj003
Google Scholar
58. Consortium U: The universal protein resource (UniProt). Nucleic Acids Res 2008, (36 Database):D190-D195.
Google Scholar
Ben-Hur A, Noble WS: Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics 2006, 7(Suppl 1):S2. 10.1186/1471-2105-7-S1-S2
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

This research was supported by the Austrian Research Promotion Agency (FFG) under the FHplus program and has been financed by the Austrian government (BMVIT and BMBWK) as well as by our co-financing partner Salzburger Landeskliniken GmbH. The authors thank all partners and colleagues that contributed to this project with their work, especially Wolfgang Straßer and Doris Siegl, who developed the underlying PRIMOS system. Any opinions, findings, and conclusions or recommendations in this paper are those of the authors and do not necessarily represent the views of the research sponsors.

Author information

Authors and Affiliations

Upper Austria University of Applied Sciences, Softwarepark 11, 4232, Hagenberg, Austria
Christian Frech, Michael Kommenda, Viktoria Dorfer & Thomas Kern
Paracelsus Medical Private University, Department of Dermatology, Müllner Hauptstraße 48, 5020, Salzburg, Austria
Helmut Hintner, Johann W Bauer & Kamil Önder
Department of Cell Biology, University of Salzburg, Salzburg, Austria
Kamil Önder

Authors

Christian Frech
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kommenda
View author publications
You can also search for this author in PubMed Google Scholar
Viktoria Dorfer
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Kern
View author publications
You can also search for this author in PubMed Google Scholar
Helmut Hintner
View author publications
You can also search for this author in PubMed Google Scholar
Johann W Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Kamil Önder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Frech.

Additional information

Authors' contributions

CF conceived the method as well as its biological motivation, designed and conducted data analysis, and drafted the manuscript. MK and VD provided the data for analysis, were involved in many fruitful discussions, and revised the draft manuscript. TK supported in algorithm design and coordinated the project. HH and JB piloted the underlying PRIMOS system and contributed with biomedical knowhow. KÖ initiated the project, contributed with ideas throughout development, and revised the draft manuscript.

Electronic supplementary material

12859_2008_2751_MOESM1_ESM.pdf

Additional file 1: Supplementary information. File with supplementary information referenced in the main document. (PDF 72 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Frech, C., Kommenda, M., Dorfer, V. et al. Improved homology-driven computational validation of protein-protein interactions motivated by the evolutionary gene duplication and divergence hypothesis. BMC Bioinformatics 10, 21 (2009). https://doi.org/10.1186/1471-2105-10-21

Download citation

Received: 09 June 2008
Accepted: 19 January 2009
Published: 19 January 2009
DOI: https://doi.org/10.1186/1471-2105-10-21

Improved homology-driven computational validation of protein-protein interactions motivated by the evolutionary gene duplication and divergence hypothesis