Improved homology-driven computational validation of protein-protein interactions motivated by the evolutionary gene duplication and divergence hypothesis

Background Protein-protein interaction (PPI) data sets generated by high-throughput experiments are contaminated by large numbers of erroneous PPIs. Therefore, computational methods for PPI validation are necessary to improve the quality of such data sets. Against the background of the theory that most extant PPIs arose as a consequence of gene duplication, the sensitive search for homologous PPIs, i.e. for PPIs descending from a common ancestral PPI, should be a successful strategy for PPI validation. Results To validate an experimentally observed PPI, we combine FASTA and PSI-BLAST to perform a sensitive sequence-based search for pairs of interacting homologous proteins within a large, integrated PPI database. A novel scoring scheme that incorporates both quality and quantity of all observed matches allows us (1) to consider also tentative paralogs and orthologs in this analysis and (2) to combine search results from more than one homology detection method. ROC curves illustrate the high efficacy of this approach and its improvement over other homology-based validation methods. Conclusion New PPIs are primarily derived from preexisting PPIs and not invented de novo. Thus, the hallmark of true PPIs is the existence of homologous PPIs. The sensitive search for homologous PPIs within a large body of known PPIs is an efficient strategy to separate biologically relevant PPIs from the many spurious PPIs reported by high-throughput experiments.

"... recurrent events of single-gene duplication and domain rearrangement repeatedly gave rise to distinct networks with almost identical hub-based topologies, and multiple activators and repressors. We thus provide the first empirical evidence for scale-free protein networks emerging through single-gene duplications, the dominant importance of molecular modularity in the bottom-up construction of complex biological entities, and the convergent evolution of networks." Another work focused on gene-regulatory interaction networks in E. coli and S. cerevisiae (Teichmann and Babu, 2004). From previous investigations it was already known that duplication of transcription factor genes followed by inheritance of interaction has contributed considerably to the growth of the regulatory network, with more than twothirds of E. coli (77%) and S. cerevisiae (69%) transcription factors having at least one interaction in common with their duplicates. Teichmann and Babu found, "in both organisms, only a small fraction (10%) of the interactions evolved by innovation, consisting of transcription factors and target genes without homologs. Almost 90% of the interactions evolved by duplication of either a transcription factor or a target gene." van Noort et al. investigated the gene co-expression network in S. cerevisiae, in which genes are linked when they are coregulated (van Noort et al., 2004). They derived a simple duplication-divergence model for its evolution based on the observation that there is a positive correlation between the sequence similarity of paralogs and their probability of co-expression or sharing of transcription factor binding sites. They conclude that their model "reproduces the scale-free, small-world architecture of the coregulation network and the homology relations between coregulated genes without the need for selection either at the level of the network structure or at the level of gene regulation." Although this study is based on data from a completely different experimental technique, again the resulting protein network could be fully explained by two simple evolutionary events: gene duplication with subsequent divergence. Pereira-Leal and Teichmann emphasized the importance of duplication with subsequent divergence in the evolution of protein complexes (Pereira-Leal and Teichmann, 2005). They observed, "at least 6%-20% of the protein complexes have strong similarity to other complexes; thus a considerable fraction has evolved by duplication." In a second work, they studied protein complexes in S. cerevisiae, complexes of known three-dimensional structure in the Protein Data Bank and clusters of pairwise protein interactions in the networks of several organisms (Pereira-Leal et al., 2007). They found, "duplication of homomeric interactions, a large class of protein interactions, frequently results in the formation of complexes of paralogous proteins. This route is a common mechanism for the evolution of complexes and clusters of protein interactions." Interestingly, Ispolatov et al. (2005) found that homodimers on average have twice as many interaction partners than non-self-interacting proteins, suggesting that most of the interactions between paralogs are actually inherited from ancestral homodimeric proteins rather than established de novo after duplication. In fact, homodimers might well have played a crucial role in PPI evolution, since they have an increased affinity for self-interaction due to structural complementarity (Lukatsky et al., 2007;Monod et al., 1965), and could thus once have provided the 'raw material' for the evolution of larger, highly complex PPI networks.
We used both FASTA (fasta34.exe, v3.4) and PSI-BLAST (blastpgp.exe, v2.2.16) to determine homologs for all proteins contained in our PPI gold standard data sets (see Gold Standard Data Sets). E-value thresholds for both programs were set to 10, the ktup parameter of FASTA was set to 1, and the number of iterations for PSI-BLAST was set to 10. All other program parameters were left default. Our search resulted in 487,008 homology assignments found by FASTA and 2,797,323 homology assignments determined by PSI-BLAST.
In a final filter step we excluded homologous PPIs from organisms with less than 100 PPIs. Due to codon bias and thus questionable search results homologous PPIs from Plasmodium falciparum were excluded from our analysis as well. Supplementary Table 1 lists finally considered organisms as well as the number of PPIs and the number of interacting proteins in each organism.

Background: Binary Classification and Receiver Operator Characteristics (ROC)
A binary classifier is a mapping of instances to two possible outcomes. For PPI validation, these outcomes are either (1) a PPI is predicted to be true or (2) a PPI is predicted to be false. There are four possible outcomes of this classification problem:

1)
A PPI is predicted to be true and it is actually true; then it is called a true-positive (TP).

2)
A PPI is predicted to be true, but it is actually false; then it is called a false-positive (FP).

3)
A PPI is predicted to be false and it is actually false; this is called a true-negative (TN).

4)
A PPI is predicted to be false, but it is actually true; then this is a false-negative (FN).
Two important measures of a binary classifier are the True-Positive Rate (TPR) and the False-Positive Rate (FPR).
The TPR is the fraction of correctly classified positives among all positive instances, and the FPR is the fraction of the incorrectly classified negatives among all negative instances. These two measures essentially capture the information of all four possible outcomes of a binary classifier. Note that the TPR is equivalent to sensitivity, and the FPR is equal to 1 − specificity.
One possibility to illustrate the overall performance of a binary classifier is to create a Receiver Operator Characteristics (ROC) curve. In signal detection theory, a ROC or ROC curve is a graphical plot of the FPR (x-axis) vs. TPR (y-axis). Each point in a ROC curve represents a pair of TPR and FPR values obtained by applying a different numerical threshold to the classifier (in our case, this classifier is the PPI score as introduced in the Methods section). The best possible prediction method would yield a point in the upper left corner of the ROC diagram, representing 100% sensitivity (all true-positives are found) and 100% specificity (all true-negatives are found). A completely random guess would give a point along a diagonal line (the line of no-discrimination), from the left bottom to the top right corners. In practice, classification performance lies somewhere in-between these two extremes. The closer the curve fits to the upper left corner of the diagram the better. For a more detailed introduction to ROC curves refer to the literature, for example Fawcett (2004).

Distribution of Scores
ROC curves as shown in the Figures 4, 5, and 6 are very useful to illustrate the overall performance of a classification method but do not explicitly convey information about which score threshold was actually used to achieve a certain TPR and FPR. Supplementary Figure 1 addresses this question. Figure 1. Distribution of scores within the GSP and GSN data set. The x-axis depicts different score thresholds between 0 and 1, and the y-axis shows the percentage of GSP PPIs (white bars) and GSN PPIs (grey bars), respectively, that achieved a score equal or above this threshold. GSP PPIs comprised all PPIs of the Combined data sets (2,723 PPIs in total). GSN PPIs were 50,000 PPIs taken from our Random data set.