Comparison of protein interaction networks reveals species conservation and divergence

Background Recent progresses in high-throughput proteomics have provided us with a first chance to characterize protein interaction networks (PINs), but also raised new challenges in interpreting the accumulating data. Results Motivated by the need of analyzing and interpreting the fast-growing data in the field of proteomics, we propose a comparative strategy to carry out global analysis of PINs. We compare two PINs by combining interaction topology and sequence similarity to identify conserved network substructures (CoNSs). Using this approach we perform twenty-one pairwise comparisons among the seven recently available PINs of E.coli, H.pylori, S.cerevisiae, C.elegans, D.melanogaster, M.musculus and H.sapiens. In spite of the incompleteness of data, PIN comparison discloses species conservation at the network level and the identified CoNSs are also functionally conserved and involve in basic cellular functions. We investigate the yeast CoNSs and find that many of them correspond to known complexes. We also find that different species harbor many conserved interaction regions that are topologically identical and these regions can constitute larger interaction regions that are topologically different but similar in framework. Based on the species-to-species difference in CoNSs, we infer potential species divergence. It seems that different species organize orthologs in similar but not necessarily the same topology to achieve similar or the same function. This attributes much to duplication and divergence of genes and their associated interactions. Finally, as the application of CoNSs, we predict 101 protein-protein interactions (PPIs), annotate 339 new protein functions and deduce 170 pairs of orthologs. Conclusion Our result demonstrates that the cross-species comparison strategy we adopt is powerful for the exploration of biological problems from the perspective of networks.


Background
The activity of cellular life relies on properly functioning of the extremely complex interaction networks among numerous intracellular constituents. The analysis of the topology and dynamics of these networks within a living cell offers a new window to explore the problems relating principles on the construction, function and evolution of life [1]. Progress in identifying the protein-protein interactions (PPIs) within the protein interaction networks (PINs) has furnished us with powerful high-throughput approaches, such as the two-hybrid assay [2], affinity purification [3], protein chips [4] and phage display [5], as well as computational methods [6,7]. To date, these technologies have generated large PINs for several model organisms, such as H. pylori [8], S. cerevisiae [9,10], C. elegans [11] and D. melanogaster [12] and large amount of data has been deposited in publicly accessible databases, including DIP [13], BIND [14], MINT [15] etc.
Both opportunities and challenges are present in the study of molecular interaction networks. High error rate in highthroughput data requires the enhancement of our abilities in discrimination of true PPIs from false positives [16] as well as data collection to avoid false negatives. Network topology information can be used to predict protein functions [17] and reformulate old questions from a network perspective [18,19]. Besides, studies on complex networks have uncovered unexpected nonrandom global organizational patterns, some of which also exist in PINs. One of the most significant features is the scale-free organization of PINs [11,12,20,21]. The scale-free topology is associated with the ability of resilience against components failure and environment changes [21,22]. To address the possible mechanisms in the development of scale-free structure of real PINs, several models based on gene duplication and divergence have been proposed [23,24]. It was also found that signatures of hierarchical modularity are present in PINs [12,20], which urges objective definition and automatic identification of topological and functional modules [25][26][27]. In addition, recent decomposition of PINs into motifs discloses some specific patterns of PINs at the local level [28,29].
As a powerful method, cross-species comparison often provides insights into the underlying laws behind complex biological phenomena. Motivated by this we propose an efficiently computational strategy called NetAlign to enable the comparative analysis of two PINs. NetAlign searches for conserved network substructures (CoNSs) that can pair in two PINs by integrating information on interaction topology and protein sequences. It implements a modified graph comparison algorithm and a clustering rule to accomplish pairwise comparison of PINs, and includes two processes for scoring and evaluating the identified CoNSs (figure 1). We apply the NetAlign method to the seven PINs of E. coli, H. pylori, S. cerevisiae, C. elegans, D. melanogaster, M. musculus and H. sapiens and perform twenty-one genome-scale pairwise comparisons among them (figure 2, 3,4,5,6,7,8,9,10). We show that beyond what is gleaned from the genome, PIN comparison not only reveals species conservation but also indicates potential species divergence at the PIN level. And the identified CoNSs are known or candidate conserved complexes and can be used to predict PPIs, protein functions and orthologs.

Conservation of PINs
As seen from the twenty-one pariwise comparisons, PINs have only minor overlap (Table 1). This attributes to the incompleteness of data and the difference among species. We introduce an overlap score to evaluate the overlap between any two PINs N Q and N T . The overlap score is defined as (Q C /Q 0 +T C /T 0 )/2, where Q C is the number of conserved PPIs in N Q derived from the comparison between N Q and N T , Q 0 is the the number of PPIs in N Q ; T C and T 0 are their counterparts in N T . This score ranges from 0 (i.e. N Q and N T never overlap) to 1 (i.e. N Q and N T overlap completely). Obviously, given complete interaction data, the overlap score can quantify species conservation from the view of PIN. Even in case of poor data, some Schematic of pairwise network comparison in NetAlign Figure 1 Schematic of pairwise network comparison in NetAlign. The comparison between two PINs is accomplished by a fast subgraph isomorphism algorithm and the resulting s-CoNSs are connected maximal common subgraphs (MCS) and exact matches of the two networks. The s-CoNSs are further merged by a clustering rule to produce c-CoNSs that allow inexact match among homologous regions of interaction in the two networks. The identified s-CoNSs and c-CoNSs are scored on the basis of their interaction topologies and evaluated by statistical significance.
implications can also be obtained. Given that the observed PPIs are from random sampling of real PINs, the overlap score can still reflect the conservation between PINs to some extent. It seems that close species would have larger overlap. For instance, although the two bacterial PINs are not so large, they overlap with each other more than with some other larger PINs such as that of D.melanogaster; another example is the significant overlap between the PINs of mouse and human, both of which are nearly the smallest among the seven. In addition, there is an obvious decrease in the number of identified c-CoNSs compared with that of identified s-CoNSs and it suggests great redundancy exists in s-CoNSs. In fact, this results from gene duplication and divergence that make many small and local duplicated interaction topologies in PINs.
What are the identified CoNSs with regard to? One way to answer this question is to inspect their functions. We associate proteins with their known biological functions using the Gene Ontology annotations (GO; Oct 2005 version;    to the hierarchical structure of GOs, for each protein we propagate its GO annotations upwards through the GO hierarchy and retrieve all the relevant GO annotations. We define that a CoNS to be functionally homogenous, if it contains at least a GO annotation that satisfies the following conditions: (1) for either of the corresponding two species, at least half of its proteins in the CoNS have this GO annotation; (2) the annotation is sufficiently specific, namely at least at GO level four from the root of the GO hierarchy. It is found that more than 80 percent of the CoNSs are homogenous, that is, CoNSs are also functionally conserved across species. Furthermore, to get an estimation of the function distribution of the CoNSs derived from a pairwise PIN comparison, we consider ten functional categories concerning cellular function selected from top levels of the GO hierarchy. For each CoNS, the most frequent function categories satisfying the above conditions are assigned to every protein in it. Then the function categories assigned in all the CoNSs are pooled together and the frequency of each function category is computed. We find that the most plenty functions are related to cellular metabolism and energy, and the functions involving in transport, signaling and cell cycle are also abundant (figure 11).

Divergence of PINs
Species divergence is usually studied in terms of genomes. However, it is obvious that species divergence must also be present at the level of PINs. Here, by virtue of CoNS difference between species, we probe the conservation of the interaction topology of orthologs across species. Since s-CoNSs are exactly matched subnetworks, it indicates that Representative c-CoNS: ATP synthase Figure 6 Representative c-CoNS: ATP synthase.
Representative c-CoNS: synaptosomal neurotransmitter release Figure 7 Representative c-CoNS: synaptosomal neurotransmitter release. different species harbor many locally conserved interaction regions that are topologically identical. Many s-CoNSs are almost the same except for minor differences due to matching permutations and it reflects the duplication of genes and interactions. On the other hand, many of the matched c-CoNSs of different species show that although they have similar framework of interaction topology, their detailed topological organizations can be different. This also arises from duplication and divergence of genes and the associated interactions. For instance, the RNA polymerase (RNAP) identified from the PIN comparison between E.coli and H.pylori (figure 2a-d, 10) shows difference of the two bacteria in transcription. Four very similar s-CoNSs with minor matching differences constitute the corresponding c-CoNS of the RNAP. It suggests that the symmetric interaction topology of the E.coli RNAP results from a duplication event and the RNAP of H.pylori lacks this duplication and serves as a prototype of this molecular machine. So it seems that homologous local regions of interaction which are topologically identical are popular across species and these regions constitute larger interaction regions that are topologically different but similar in different species. In addition to our above analysis of function homogeneity, it is conjectured that different species achieve similar or the same biological functions by organizing orthologs in a similar but not necessarily the same interaction topology. Theoretically, any species-to-species difference in c-CoNSs discloses the difference of the corresponding two species in some aspect. Currently, however, due to the incompleteness of data, some of the identified differences may be false. But with the fast growth of data, our method offers a way to discover species difference and explore the problem of species divergence at the network level.

CoNSs vs. complexes
During the analysis of the identified CoNSs, another question concerns us: to what extent do the CoNSs overlap conserved complexes or pathways? In order to give a rough estimate of this, we use the MIPS yeast complex repertoire as a reference to evaluate the identified yeast c-CoNSs derived from the six pairwise PIN comparisons between yeast and the other species. Only those MIPS complexes that are manually annotated independently Representative c-CoNS: system of fibroblast growth factors (FGFs) and FGF receptors Figure 8 Representative c-CoNS: system of fibroblast growth factors (FGFs) and FGF receptors.
from the DIP data are considered, that is, we exclude all the complexes in MIPS category 550 that are based on high-throughput experiments. We compare the c-CoNSs with the reference complexes, and if the proportion of the intersecting proteins between a yeast c-CoNS and a MIPS complex exceeds a threshold the c-CoNS is accepted as a hit. Under the 80% overlap threshold, 70 hits concerning 61 c-CoNSs are found, which accounts for about 35% of the 172 yeast c-CoNSs (Table 2).
It is found that some c-CoNSs correspond to the whole complexes, some are parts of a certain complex and some overlap several different complexes. For instance, c-CoNS 1 from S.cerevisiae vs. C.elegans completely overlaps MIPS complex 410.40.30, the DNA replication factor C that consists of five subunits RFC1, RFC2, RFC3, RFC4 and RFC5 (this complex is also identified from the comparisons of S.cerevisiae with D.melanogaster and H.sapien); c-CoNS 26 and c-CoNS 58 from S.cerevisiae vs. D.melanogaster compose the entire MIPS complex 500.10.30, the translation initiation factor (eIF), and the former contains three subunits GCD7, GCN3 and GCD2, the latter includes the remaining two subunits GCD6 and GCD1; part of c-CoNS 2 from S.cerevisiae vs. M.musculus overlaps four proteins STE7, KSS1, STE11 and FUS3 out of the five proteins of MIPS complex 470.20, a complex involved in the activation of MAP kinase (MAPK) in the Ras pathway. These demonstrate the validity of cross-species comparison for identifying conserved functional modules in PINs and the non-hit c-CoNSs may be candidate complexes or pathways for experimental validation.

Prediction of PPIs
Based on the cross-species conservation of CoNSs, there are two ways to make use of the conserved PPIs in the identified CoNSs (Table 3). The first is rather simple. A Representative c-CoNS: bacterial RNA polymerase (RNAP) Figure 10 Representative c-CoNS: bacterial RNA polymerase (RNAP). conserved PPI observed in two species is probably also present in the third species, especially when the three species belong to the same evolutionary branch. Such-andsuch, a conserved PPI observed in more species is more likely to appear in other species. Totally, we collect 1178 conserved PPIs (additional file 1). These PPIs are useful references to check newly observed PPIs and can be transferred to other species. The second is also intuitive. Due to the conservation of CoNSs, discrepant PPIs (see red or green edges in figure 2 On the whole, our method is similar to the prediction of PPIs from interologs that are defined to be orthologous pairs of interacting proteins in different organisms [32]. However, the two methods are different in determining whether a PPI can be transferred. The latter method transfers a PPI between species on the basis of the joint sequence similarity of the corresponding two pairs of interacting proteins, while our method transfers a PPI based on the conservation of local interaction topology between species. The current interolog database includes predicted PPIs for C.elegans and D.melanogaster. We compare our predictions with them and find that our only one prediction for C.elegans is collected in the database but the fourteen predictions for D.melanogaster are not present. It is natural that the two methods can intersect, since the conservation of sequences and the conservation of interactions are consistent sometimes. However, a PPI discarded by the interolog method may also be supported by our method if it is part of a high score CoNS. So, to some extent, our method is complement of the interolog method.

Prediction of protein functions
We have seen that CoNSs are functionally homogenous and have significant coverage with known complexes. So it is natural to guess that if many proteins in a CoNS have the same function, the remaining proteins would also have that function. Based on this idea, we strictly analyze the GO annotation enrichment in c-CoNSs with a p-value < 0.001 and predict new protein-GO annotation associations whenever the following conditions are satisfied: (1) the set of proteins in a c-CoNS is significantly enriched for a particular GO annotation (p-value < 0.01); (2) the GO annotation satisfies the conditions for functional homogeneity. Then for both species, all remaining proteins in the c-CoNS are predicted to have the enriched GO annotation.
To assess the overrepresentation of a GO term, we compute a p-value of significance by a hypergeometric test that answers the question: when sampling X proteins (the set of c-CoNS proteins) out of Y proteins (the set of proteins of the species), what is the probability that x or more of the X proteins belong to a GO functional category shared by y of the Y proteins? To control the rate of false positive, the p-value is further Bonferroni corrected for multiple testing. The analysis of eukaryotic c-CoNSs gives 339 predictions of protein-GO annotation associations (additional file 3).

Discovery of orthologs
Orthologs are proteins in different species that evolved from a common ancestor by speciation and they are often deemed as having the same or similar biological functions. An important aspect of protein functions is the physical interactions of proteins with other molecules, in particular, with other proteins. Based on the concept that similarity in interaction topology may indicate similarity in function and thus orthologs, we deduce orthologs. In our prediction, we only consider s-CoNSs with a p-value < 0.001 and containing at least three conserved PPIs as acceptable orthologous local interaction regions, and take paired proteins as potential orthologs. Finally, we predict 170 pairs of orthologs that are not reciprocally best BLAST hits (additional file 4). We then compare our predictions with the Inparanoid database that collects pairwise ortholog groups of eukaryotes [33], and find that 23 of our 159 predictions on eukaryotes are present in it. To some degree, this result reflects the validity of our method. Clearly, by combining the conservation of interaction topology and sequences our method can make up for some true orthologs ignored by traditional methods.

Discussion
A related method that performs pairwise network alignment between species is the PathBLAST method [34][35][36], which offers a general solution to the problem of PIN comparison. This method searches for small seed linear high-scoring alignments and aggregates them by dynamic programming. The decomposition of problem by Path-BLAST into sub-problems is expensive in time, although each sub-problem can be solved in linear time. This fact limits its online application so that the PathBLAST server restricts a query to small scale (with no more than 5 proteins and 4 PPIs) linear topology and focuses on the identification of conserved protein interaction paths. Here, we take a completely different way. The core of our NetAlign method is subgraph isomorphism, in our case that is the identification of connected maximal common subgraphs (MCSs) of two PINs, and the followed clustering. In principle, subgraph isomorphism is NP-hard and cannot be solved for arbitrarily large networks. However the actual constraints on PIN comparison, such as limited sizes of PINs and ortholog correspondence, confine the solution space of the problem. In addition, the time-consuming and repetitious operations in searching for disconnected MCSs are avoided, which reduces the recursion tree during the search greatly. All of these make the solution of genome-scale PIN comparison feasible and efficient. The server supported by the NetAlign strategy can accept an arbitrarily connected query PIN and searches a target PIN for CoNSs with arbitrarily topological organization [37]. These features widen its application. The resulting s-CoNSs and c-CoNSs tell us different information on PINs as shown at above. The PathBLAST method allows gaps and mismatches in the alignments, while ours don't. Considering the relative poor quality of current data, we concern ourselves with more conserved local interaction topology and aim to identify conserved interaction regions that are highly confident. Our method circumvents related fuzzy matching problem by clustering and the discrepant PPIs reported are actually gaps, but they do not participate in the solving procedure as in PathBLAST.
On the whole, NetAlign and PathBLAST are different solutions to the same problem. By virtue of their different design philosophy and principle, they have different advantages.
It is well known that high-throughput data suffer errors, such as false positives and false negatives. However, our comparative strategy is not sensitive to this kind of noise. As described in the methods section, the identified CoNSs are filtered according to the statistical significance of their scores. This process prefers CoNSs with a non-randomlike configuration and size, and effectively decreases the impact of random errors. Here, we give a simple estimation of the impact of false positives. Suppose the p-value cutoff of the statistical filter is p, the fractions of false positives of the two compared PINs are q + and t + , respectively. For the two cases that lead to errors, namely two false positives match each other and a false positive matches a true positive, their probabilities are q + t + and q + (1-t + )+(1-q + )t + , respectively. Taken together, p(q + +t + -q + t + ) n gives the probability that a CoNS with n false conserved edges occurs in the result. In our analysis, only those CoNSs with a pvalue < 0.05 are taken into account, that is p = 0.05; The first row shows the number of conserved PPIs of each species derived from the identified CoNSs. The second row shows the number of new PPIs predicted on the basis of discrepant PPIs formed by conserved proteins in CoNSs.
according to a recent estimation [16], q + ≈0.5, t + ≈0.5; so, the probability that a wrong conserved edge occurs is less than 4 percent. Considering the rapid damp of the probability of error occurrence with n, it is obvious that our method is reliable even under high fraction of false positives. As for false negatives, since discrepant PPIs in CoNSs are shown as color edges, it facilitates the identification of them and thus reduces their impact. As a vivid demonstration, we perform six additional pairwise comparisons between a larger S.  4) and the complex of replication factor C (RFC) is also discovered in the pairwise comparisons among S.cerevisiae, C.elegans, D.melanogaster and H.sapien ( figure 3). These essential molecular machines are highly conserved across species. The n-way extension of the current method will shed light on these conserved interaction topologies and give more reliability as well as conservation on PPI evaluation.

Conclusion
We propose a computational strategy to perform genomescale comparative analysis of PINs and apply this approach to the seven largest PINs currently available. In spite of the incompleteness of data, PIN comparison enables us to identify species conservation and divergence present at the network level. We find that the identified CoNSs are conserved not only in topology, but also in function. And the detailed investigation of the yeast CoNSs shows that many of the CoNSs correspond to complexes. Besides, based on the species-to-species difference in CoNSs, we infer potential species divergence. We find that different species harbor many conserved interaction regions that are topologically identical and these regions can constitute larger interaction regions that are topologically different but similar in framework. So it seems that different species organize orthologs in similar but not necessarily the same topology to achieve similar or the same function. To exemplify the application of the identified CoNSs, we reformulate the problems of PPI prediction, function annotation and ortholog assignment from a network perspective. Our result demonstrates that the cross-species comparison strategy we adopt is powerful for the exploration of biological problems in PINs.

Methods
We develop an efficient computational procedure called NetAlign for comparison of two PINs. NetAlign searches for CoNSs that can pair in two PINs by integrating information on interaction topology and protein sequences. It implements a modified graph comparison algorithm and a clustering rule to accomplish pairwise comparison of PINs, and includes two processes for scoring and evaluating the identified CoNSs ( figure 1)

Graph model of PINs
In NetAlign, we model a PIN as a labeled, undirected graph N(P,I), where P is a series of vertices representing proteins and I is a set of edges representing PPIs. To compare two PINs N Q (P Q ,I Q ) and N T (P T ,I T ) from different species, it is necessary to identify the correspondences of vertices and edges in them. The correspondence between a vertex A Q in N Q and a vertex A T in N T is established, in other words, they are labeled the same, if they are putative orthologs. The ortholog relation is determined by a bidirectional BLAST search between the two species, which consists of two BALST searches, one from each direction, both with an E-value ≤ 10 -7 . This removes discrepancy in ortholog assignment arising from a uni-directional BLAST search. The correspondence between a pair of conserved PPIs A Q -B Q in N Q and A T -B T in N T is defined, if A Q corresponds to A T and B Q corresponds to B T simultaneously.

Network comparison
The aim of NetAlign is to identify CoNSs, which may derive from a common ancestor, in two PINs. The identification of CoNSs is naturally formulated as subgraph isomorphism which is a well-know NP-hard problem. To be exact, we take network comparison as enumerating all the maximal common subgraphs (MCSs) in two networks. To avoid meaninglessly repetitious combinations of components in disconnected MCSs during the solution of the problem, we only take connected MCSs into account and define them as s-CoNSs (single CoNSs; see figure 2 for examples). This greatly reduces the searching space of the problem.
To solve the MCS problem of two networks N Q (P Q ,I Q ) and N T (P T ,I T ), an edge compatibility graph G = (V,E) is built. Here, V is a set of corresponding edge pairs and is defined as V = {(i Qm , i Tn ) | i Qm ∈ I Q , I Tn ∈ I T , if i Qm corresponds to i Tn }; E establishes the connection between two edge pairs v h = (i Qa , i Ta ) and v k = (i Qb , i Tb ), where i Qa , i Qb ∈ I Q , i Ta , i Tb ∈ I T , as follows: E = {(v h ,v k )| v h , v k ∈ V; if i Qa i Qb and i Ta i Tb , and if either i Qa , i Qb in N Q are connected via a vertex corresponding to the vertex shared by i Ta , i Tb in N T , or i Qa , i Qb and i Ta , i Tb are not adjacent in N Q and N T , respec-tively}. Each complete maximal subgraph in the graph is a MCS between N Q and N T . The problem is then transformed into an all maximal cliques problem, which requires enumerating all the complete maximal subgraphs. Bron-Kerbosch algorithm is a fast and widely used algorithm for this [30]. Here we implement a variant of this algorithm, which detects all cliques representing connected MCSs.

Clustering CoNSs
Each identified s-CoNS is a solution of the network comparison and is an exact match between two subnetworks in the two PINs. However, redundancy exists in regions of interaction where paralogs interact and s-CoNSs can overlap each other. Besides, there may be inexact match between the conserved interaction regions in the two PINs due to loss, duplication and divergence of genes and their associated interactions or data incompleteness; and, these regions can be disconnected. In order to handle these, we introduce c-CoNSs (clustered CoNSs; see figure 3,4,5,6,7,8,9,10 for examples) by merging similar s-CoNSs. Two s-CoNSs are clustered if their number of intersecting vertices is equal to or greater than 80% of the smaller one for either of the two species. Three or more s-CoNSs are clustered by the rule of single linkage, that is, the clustering relation is transitive. If an s-CoNS can not be clustered with others, it forms a c-CoNS itself.

Scoring strategy
A CoNS is scored based on its size, i.e. the number of conserved PPIs it has, and its connectivity. Each connected component of a CoNS is considered independently and scored as n(n+1)/2, where n is the number of conserved PPIs in it. The ultimate score of the CoNS is the sum of these individual scores. This simple strategy gives higher scores to CoNSs with larger size and better connectivity, since they are more likely to occur not by chance but by conservation in evolution.

Statistical evaluation
In order to evaluate the statistical significance of an identified CoNS, we compute a p-value that is based on the distribution of top scores obtained by applying the above method to randomized data. A PIN is randomized by randomly shuffling the labels associated with the vertices and rewiring the edges but preserving the number of edges of the vertices. We perform 1000 rounds of comparisons between the randomized versions of the two PINs and estimate the p-value of a CoNS as the fraction of runs which result in a CoNS with the same or greater score. All the CoNSs taken into account in the analysis followed have a p-value < 0.05 unless specified explicitly.