Visualisation and graph-theoretic analysis of a large-scale protein structural interactome

Background Large-scale protein interaction maps provide a new, global perspective with which to analyse protein function. PSIMAP, the Protein Structural Interactome Map, is a database of all the structurally observed interactions between superfamilies of protein domains with known three-dimensional structure in the PDB. PSIMAP incorporates both functional and evolutionary information into a single network. Results We present a global analysis of PSIMAP using several distinct network measures relating to centrality, interactivity, fault-tolerance, and taxonomic diversity. We found the following results: Centrality: we show that the center and barycenter of PSIMAP do not coincide, and that the superfamilies forming the barycenter relate to very general functions, while those constituting the center relate to enzymatic activity. Interactivity: we identify the P-loop and immunoglobulin superfamilies as the most highly interactive. We successfully use connectivity and cluster index, which characterise the connectivity of a superfamily's neighbourhood, to discover superfamilies of complex I and II. This is particularly significant as the structure of complex I is not yet solved. Taxonomic diversity: we found that highly interactive superfamilies are in general taxonomically very diverse and are thus amongst the oldest. Fault-tolerance: we found that the network is very robust as for the majority of superfamilies removal from the network will not break up the network. Conclusions Overall, we can single out the P-loop containing nucleotide triphosphate hydrolases superfamily as it is the most highly connected and has the highest taxonomic diversity. In addition, this superfamily has the highest interaction rank, is the barycenter of the network (it has the shortest average path to every other superfamily in the network), and is an articulation vertex, whose removal will disconnect the network. More generally, we conclude that the graph-theoretic and taxonomic analysis of PSIMAP is an important step towards the understanding of protein function and could be an important tool for tracing the evolution of life at the molecular level.

One group of computational methods uses the abundant genomic sequence data, and is based on the assumption that genomic proximity and gene fusion result from a selective pressure to genetically link proteins which physically interact [14][15][16]. With the exception of conserved operons and gene fusion, however, genomic proximity is more generally indicative of indirect functional associations between proteins [17] than direct interactions between the gene products.
A second group of methods, based on the assumption that protein-protein interactions are conserved across species, was originally applied to genomic comparisons [18]. Just as common function can be inferred between homologous proteins, 'homologous interaction' can be used to infer interaction between homologues of interacting proteins. This method has been validated in a comparison between PSIMAP, which contains observed protein domain interactions in the Protein Data Bank (PDB) [19] and experimentally determined domain interactions in yeast [20]. The method has also been systematically validated at the sequence level using BLAST [21], and has been improved by the use of a statistical domain level representation of the known protein interactions [22,23]. PSIMAP, the Protein Structural Interactome Map [20], is a database of all the structurally observed interactions between protein domains of known three-dimensional structure in the PDB. It can be constructed using any reliable protein domain definition, where domains are defined as evolutionarily conserved structural and functional protein units. Here we use the domain definitions provided by SCOP (Structural Classification of Proteins) [24], which uses structural and functional homology to manually define evolutionarily distinct protein domain families and superfamilies. Alternatively, other domain definitions (such as CATH [25], FSSP [26], Pfam [27], etc.) can be used.
Domains from a multi-domain PDB entry are empirically denoted as interacting with each other if at least 5 residue pairs are within 5 Angstroms (see Figure 1). Although the data in the PDB is relatively limited in comparison to the available sequence data, it is much more comprehensive when compared to the available protein interaction data [28].
PSIMAP provides an overview of all the observed domaindomain interactions at the superfamily level. Considering interactions at this level is important with respect to the stability of the network; while the number of PDB entries is growing superlinearly, the number of new folds is only increasing linearly (see Figure 2). It is probable that there are no more than 2,000 distinct protein topologies in nature [29][30][31][32][33]. Because of the slow growth in the number of new superfamilies and superfamily interactions over time (data not shown) PSIMAP represents the first global overview of interactions at this level. For example the recent conservative superfamily assignment of 56 genomes covered between 40-67% of the total detected genes in eukaryotes and eubacteria (~100,000 genes) and between 31-54% of the total detected genes in archaebacteria (~10,000 genes) [34]. As a significant portion of the unassigned genes may represent trans-membrane proteins not structurally determined due to experimental difficulty, it is reasonable to suggest that the PDB, and PSI-MAP, covers many of the existing globular superfamilies in nature.
By viewing interaction between superfamilies, which encompass extremely distant evolutionary relationships [24], PSIMAP represents domain interaction within a broad evolutionary context. The analysis of PSIMAP's network topology presented here necessarily incorporates this evolutionary perspective.
Using different numbers of residue-residue contacts within different distances (contact threshold and distance threshold respectively) has a striking effect on the total number of superfamily-superfamily interactions defined. An analysis of the empirical domain interaction criterion is shown in Figure 3. Above the 4 Angstrom distance threshold, different contact thresholds yield qualitatively similar results, giving a roughly linear increase in the number of superfamily-superfamily interactions observed as the contact threshold is decreased. At the 4 Angstrom distance threshold, however, the contact threshold has the biggest effect on the number of domain-domain interactions observed, giving a roughly exponential increase in the number of superfamily-superfamily interactions observed as the contact threshold is decreased. This suggests that most domain-domain interactions occur in this approximate distance range (between 4 and 5 Angstroms). Using a contact threshold of 5 is very discriminative at the 4 Angstroms distance threshold, so the "5 by 5 rule" (defined previously [20]) is a reasonably safe choice of interaction criteria. Additionally, Tsai et al [35] show that extracting domain interaction from the PDB is a robust process.
By using a structural domain definition to extract domaindomain interactions from the PDB, it is possible to assign covalently linked domains as interacting. These 'intrainteractions' are in the minority, accounting for approximately 30% of the 20370 domain-domain interactions observed. For a breakdown of the 1232 observed superfamily-superfamily interactions (generated using SCOP version 1.61) see table 1. The validity of assigning superfamily interaction solely on the basis of observed intradomain (covalently linked) interaction is extensive. Domain fusion has been successfully used to predict protein interaction from sequence information alone [17] and as a hypothesis for the evolution of homo [36] and hetero [14] dimers. In addition, it has been observed that intra-domain interfaces have strong similarities to interdomain interfaces within multi-domain proteins [35,37,38]. Finally, such multi domain proteins can be identified as independent, interacting domains in ancestral genomes [14].
To check for potential sampling error in the PDB, we checked if the absolute number of domains in a superfamily is correlated to the number of observed interactions that that superfamily makes. We did not find significant evidence for this correlation: omitting four outliers, the correlation coefficient between the number of interactions and number of domains in a superfamily is only 0.16. This suggests that a superfamily's interactivity is independent of its occurrence in the PDB.
Visualizing structurally observed protein domain interaction at the superfamily level gives a very robust network Two interacting domains Figure 1 Two interacting domains. Given two domains with coordinates of their residues (left), PSIMAP detects all residue pairs of the two domains within a given distance threshold (right). The two domains shown are classic TIM barrel folds from triosephosphate isomerase (7 tim).
which incorporates both a broad evolutionary perspective of protein interaction with the conserved structural and functional features of protein domains. PSIMAP, therefore, represents a rich, stable overview of the protein interactome.

Results
Protein interaction databases such as BIND [39] and DIP [40] provide web interfaces which allow the examination of a small number of individual proteins and their interactions. They do not support the large-scale visualisation of protein interaction networks. This need has been addressed by several visualisation systems [41][42][43]. Protein interaction networks are large, however, requiring more than simple visualisation for effective data mining [12]. Consequently, there have been several global analyses of protein interaction networks (for example [44][45][46]).
Here, we employ an integrated package (PSIEYE [47]), which complements these approaches by integrating several graph-theoretic and taxonomic measures with network visualisation and exploration. Our analysis has been motivated by several questions of particular biological interest (outlined below). While these questions are relevant to the analysis of any biological network, it is important to note the additional evolutionary perspective provided by PSIMAP when analysing this network.
1. Which superfamilies can directly or indirectly interact with each other forming subnetworks (does PSIMAP contain evolutionarily distinct interaction networks)?
2. Which superfamilies can disrupt a pathway in the network if removed (highlighting critical pathways or distinct functional contexts for superfamilies in PSIMAP)?
PSIMAP is based on the Protein Data Bank [19], which grows exponentially Figure 2 PSIMAP is based on the Protein Data Bank [19], which grows exponentially. PSIMAP is nonetheless relatively stable, as it considers interactions at superfamily level, which grows only linearly.

New Folds
3. Are there multiple indirect interactions between superfamilies (making the overall superfamily interaction network topology robust, highlighting sets of superfamilies with common functional roles)?
4. How many interaction partners has a superfamily acquired over the course of evolution? 5. How well connected is the neighbourhood of a superfamily (how is a superfamily related to the rest of the network)?
Number of superfamily-superfamily interactions observed using different residue-residue contact count and residue contact distance thresholds to analyse domain-domain contacts in the PDB Figure 3 Number of superfamily-superfamily interactions observed using different residue-residue contact count and residue contact distance thresholds to analyse domain-domain contacts in the PDB. Below four Angstroms almost all superfamilies are 'isolated' from the interaction network, making very few residue-residue contacts with other superfamilies at this range. At four Angstroms, the number of interactions observed is critically dependant on the number of residue contacts threshold used, while at five Angstroms the contact count threshold has less effect. 6. How central is a superfamily in the network (which superfamilies make the most fundamental contribution to the overall network)?
7. Is there a core in the network (has the network grown from a core, critical set of interactions)?
8. How is the network distributed within and between taxonomic groups with respect to the above measures (how diverse is the interaction network in nature)?
The last question is particularly relevant for PSIMAP, as it is based on a reliable definition of homology applied to all the available multi-domain protein structures in the PDB. PSIMAP forms a global interaction network across many species, which can be extended using sequencebased homology searches [48]. We applied graph-theoretic and taxonomic measures to PSIMAP using the PSI-EYE tool to answer the above questions.
The PSIMAP algorithm generates a large network consisting of 937 superfamilies and 538 interactions, with 512 distinct components ranging in size from the single largest component of 320, which will be the basis for further analysis, to some 400 isolated non-interacting single superfamilies, distributed according to a power-law. These and all subsequent analysis are based on PSIMAP produced from SCOP version 1.59. To analyse PSIMAP we will follow two strands: First, we looked at the network topology of the map in terms of location and interactivity. Second, we characterized the taxonomic diversity of the superfamilies. The former analysis can be broken down into two distinct aspects: location and the interactivity.

Location
Previously, it has been shown that central proteins in an interaction network are often functionally critical and their removal correlates to lethality [44]. Wuchty and Stadler define three types of centrality and apply them to metabolic and protein interaction networks [45].
We follow this approach and use two measures of network centrality, namely eccentricity and sum of distances. The eccentricity of a vertex (used to represent a superfamily in PSIMAP) is the path distance to the farthest vertex in the network. The vertices with the minimum eccentricity form the center of the network. In contrast to eccentricity, the sum of distances averages the path distance to all other vertices in the network. The barycenter is the vertex or vertices with the minimal sum of distances. Given these definitions, the center and barycenter are not necessarily the same as shown in Figure 4, where vertex A is the center, but its neighbour B is the barycenter.
In PSIMAP, the P-loop containing nucleotide triphosphate hydrolases (c.37.1) is the barycenter (see Figure 5) with the minimum sum of distances (taking 947 steps to reach the 320 superfamilies in the largest component). It is followed in the measurement of minimum sum of distances by Immunoglobulin (b. . These superfamilies are involved in a broad and comprehensive range of critical cellular functions, such as regulation of gene expression, cellular transport, control of the cytoskeleton, phosphorylation, nuclear division, signalling, A/ GTPase activity, immunity, and carbon and nitrogen metabolism. These nearly ubiquitous and critical functions associated with superfamilies close to the barycenter reflect their critical position in the network as these superfamilies are, on average, closely associated with every other superfamily in the network's main component. By contrast, the most peripheral superfamily, with the maximum sum of distances (taking 3248 steps to reach the 320 superfamilies in the main component) is the GroEL-like chaperone, ATPase domain superfamily (a.129.1). This superfamily has a very specific function, mediating the folding and organisation of other polypeptides in order that they form the correct oligomeric structure [49].
The center of the network is in the same neighbourhood as the barycenter with two superfamilies (NTN hydrolases, d.153.1 and nucleotidylyl transferase, c.26.1) in common within the six highest of both centrality Eccentricity and sum of distance Figure 4 Eccentricity and sum of distance. The vertex A is the center of the network, having the least distance to travel to its furthest node. Vertex B is the barycenter, having the least overall distance to every other node.

B A
measures. There are six centers in the PSIMAP network with equally small eccentricity. They are PK beta-barrel domain-like (b.58.1), Nucleotidylyl transferase (c.26.1), Ntn hydrolases (d.153.1), FMN-linked oxidoreductases (c.1.4), HPr-like (d.94.1) and Adenine nucleotide alpha hydrolases (c.26.2). As with the superfamilies which rank highly in the barycenter measurement, these superfamilies are involved in highly critical cellular functions including glycolysis; galactose / fructose metabolism and nucleotide, amino acid, lipopolysaccharide, NAD and ATP synthesis. In comparison to the barycenter superfamilies, members of the center are related to more specific enzymatic activities with an emphasis on energy metabolism and macromolecular synthesis. Conversely, members of the barycenter mediate their function via structural interactions, involving molecular switching, signalling, transport, DNA binding and protein-protein interaction. Additionally, the majority of the observed enzymatic functions in the barycenter can be attributed to the ubiquitous P-loop domain.
The slight shift in topology between the center and the barycenter in PSIMAP reflects a slight shift in the functional characteristics of the overlapping subgroups of superfamilies in the topological region. Intuitively, we hypothesize that those critical superfamilies which have general functions or a predominantly structural mode of action will have a greater number of interaction partners (which is a requirement for the highest sum of distance in the barycenter). More specific but none the less critical The sum of distance, the sum of all the shortest paths from a superfamily to every other superfamily in the network Figure 5 The sum of distance, the sum of all the shortest paths from a superfamily to every other superfamily in the network. Blue indicates low sum of distance (central) and red high sum of distance (rim). The 6 superfamilies with lowest sum of distance are c.37.1, P-loop containing nucleotide triphosphate hydrolases; b.1.1, Immunoglobulin; d.153.1, N-terminal nucleophile aminohydrolases (Ntn hydrolases); a.118.1, ARM repeat; c.26.1, Nucleotidylyl transferase; a.4.5, Winged helix DNA-binding domain enzymatic functions on the other hand will be associated with many different pathways, but may mediate indirect functional roles via common metabolites and thus need not make direct physical interactions with many different members of the network.
Global overviews of the center and barycenter are given in figures 5 and 6. The colour-coding of these figures indicates that the majority of superfamilies have medium eccentricity, yet small sum of distances. Intuitively, the low sum of distance means that the majority of superfamilies are member of or attached to a well-connected core and can thus reach all other superfamilies via short average paths. Eccentricity does not take this aspect of connectivity into account and most superfamilies have medium eccentricity.
Interactivity PSIEYE provides three measures of interactivity: connectivity, cluster index, and interaction rank. The connectivity of a vertex is simply the number of interaction partners it has. The superfamilies shown in table 2 are the 19 most interactive in PSIMAP. Figure 7 shows the most highly connected superfamilies in PSIMAP form a single connected component. Thus, the high connectivity, core superfamilies do not break down into distinct clusters, but rather form one single, central kernel at the heart of the network.
A superfamily's eccentricity is the maximal distance to any other superfamily in the network The majority of the most highly connected superfamilies contain families of functionally important enzymes, with only three main exceptions. They are: 1) domains from the Immunoglobulin superfamily (b.1.1), frequently found as domain linkers in genomic sequences and structures, having diverse structural roles and interacting with many different proteins; 2) domains from the EF hand allalpha superfamily (a.39.1), a structural motif (with an average size of around 40 amino acids) involved in calcium binding and the diverse regulatory functions associated with calcium; 3) the winged helix DNA-binding domain (a.4.5), which has an extremely diverse set of functions related to DNA binding. For example, the winged helix domain associates with many different small molecule binding domains to form functionally diverse families of transcription factors in prokaryotes and eukaryotes [50].
The most highly interactive superfamilies take part in a wide range of critical cellular reactions, mostly relating to energy metabolism and catabolism as well as signalling and structural roles. For example, the iron-sulfur proteins (d.58.1 and d.15.4) transfer electrons in a wide variety of metabolic reactions, indicating a very early origin in protein evolution. The PK-like superfamily (d.144.1) encompasses enzymes that belong to a very extensive family of proteins involved in almost all aspects of eukaryotic signal transduction pathways, including regulation of the cell cycle, differentiation, homeostasis and the immune response. Members of this superfamily share a conserved catalytic core common with both serine/threonine and tyrosine protein kinases [51], and have related but uncharacterised counterparts in archae as well as functional homologues in viruses. Ubiquitin-like superfamily domains are found in an extremely broad range of protein families, having structural roles in proteolysis (including the unfolded protein response pathway), and linking cytoskeleton proteins to proteins in the plasma membrane, as well as having roles in signal transduction. Raf-like and Ras-binding activity, guanine nucleotide exchange activity and GTP activated Phosphatidylinositol 3-kinase activity as part of the phosphatidylinositol 3kinase complex [52]. This superfamily is also involved in DNA repair mechanisms, chromosome segregation, viral infection, splicing, autophagy and the regulation of membrane physical properties and cell development. Indirect connectivity is useful in that it links the measures of connectivity to the location measures discussed above. Due to the particular structure of the network, the most highly interactive and central superfamilies also have higher indirect connectivity than peripheral superfamilies. Indirect connectivity shifts our perspective from a purely local view of interactivity towards a more global view. However, connectivity and indirect connectivity do not quantify interaction density, a measure to describe the extent to which a superfamily's interaction partners interact with each other.
Cluster index [53] is a measure of interaction density, and is defined as the number of interactions between a vertex's neighbours divided by the total number of possible interactions between them. A cluster index of 0 means that none of a vertex's neighbours interact, whereas a cluster index of 1 indicates that they all interact with each other.
A high cluster index is more likely for low connectivity superfamilies, as the number of possible interactions between neighbours increases quadratically with an increasing number of interaction partners. This is highlighted by looking at the cluster index of the P-loop (c.37.1), which is the most highly interactive superfamily with 46 interaction partners, but which has a very low cluster index (0.011). In contrast, the succinate dehydrogenase/fumarate reductase catalytic domain (d.168.1) has the highest possible cluster index of 1, as all of its three The 19 most highly connected superfamilies form a connected component and are also highly diverse as the colour coding shows (red = high diversity) The interaction partners of the 3 superfamilies d.168.1, a.1.2, and c.3.1, noted above overlap considerably to form a well-connected subnetwork. Analysis of the members of this subnetwork reveals that they correlate closely to various members of the mitochondrial respiratory chain. In particular, they match subunits of complex I and complex II, indicating that perhaps this subnetwork is representative of the two complexes and the interactions between their subunits. This could be highly significant as, whilst the structure of the complex II has been solved, the structure of complex I has not yet been elucidated.
The respiratory chain involves a series of membranebound proteins that use a series of electron transfer steps to create a proton gradient across the mitochondrial membrane. This proton gradient is then used as the driving force for ATP synthesis. Complex I is the first protein complex in the respiratory chain. It is part of a redox reaction, catalysing the oxidation of NADH from the citric acid cycle along with the reduction of ubiquinone. The oxidation of NADH is coupled to electron transfer via a Flavin MonoNucleotide (FMN) prosthetic group, which acts as a first acceptor of electrons from NADH. Electron transfer is carried on through complex I by several ironsulphur (FeS) clusters in the protein. Complex II (succinate ubiquinone oxidoreductase) is the second protein complex in the chain. It is involved in a different redox reaction, catalysing the oxidation of succinate (also a product of the citric acid cycle) to fumarate, along with the reduction of ubiquinone to ubiquinol. Succinate is oxi-dised by using the bound FAD on the 70 kDa subunit as an electron acceptor. As in complex I, several FeS clusters, which are found in the 27 kDa subunit, help in electron transfer through the protein.
To answer whether complex I and complex II relate to the above networks, we mapped known complex I and complex II protein subunits to their SCOP superfamilies via PSI-Blast [54]. Assigned complex I superfamilies account for the majority of the superfamilies in the smaller network (Figure 10), which shows alpha-helical ferredoxin (a.1.2) and its neighbours, and on the left half of the larger subnetwork (Figure 11), which shows the FAD/ NAD(P)-binding domain and its neighbours. Complex II superfamilies account for at least 5 of the other Superfamily c.3.1 has 11 interaction partners and medium cluster index The above findings show that 9 out of 11 neighbours of the FAD/NAD (P) binding domain belong to or are related to either complex I or complex II, or both. A subnetwork has been identified around this highly connected superfamily that has a comparatively high cluster index. This stresses both the importance of the superfamily and also the importance of connectivity and cluster index as a measure that is especially useful in uncovering complexes.

Interaction Rank
Both connectivity and cluster index have shortcomings: Connectivity does not consider interactions in a vertex's neighbourhood; cluster index favours low connectivity vertices. To get a better measure for the wider neighbourhood of a vertex, we have developed the idea of interaction rank, which treats interaction networks as Markov processes. In this analysis, each edge in the network is equated with a state transition in a Markov process. A similar approach has been used for the analysis of clusters in a network [55]. For example, a superfamily with a certain number of interaction partners, p, corresponds to a state, v, in the Markov process with p possible successor states w ∈ N(v), (where N(v) is the set of v's neighbours). A priori, each of the transitions is chosen with the same likelihood, giving a 1/|N(v)| chance for v to 'interact' with w ∈ N(v), where |N(v)| is the size of the set. If we enumerate all vertices from v 1 to v n , we can capture this Markov process as a transition matrix M = (m ij ), where for all 1 ≤ i, j ≤ n entries, m ij = 1/|N(v i )| if v i is connected to v j or 0 otherwise. If we compute the steady state transition probabilities of this process, we can rate vertices according to this notion of 'interactivity'. We call this rating 'interaction rank'. Essentially, the more interaction partners a superfamily has, the better its interaction rank. Also, the better connected a superfamily's neighbourhood, the better the interaction rank. These two trends are intuitively a consequence of the increased probability of indirectly returning back to a superfamily via the interconnections between its interaction partners. In this way, interaction rank combines aspects of connectivity and cluster index. It does so at a global scale incorporating information about the topology of the whole network. In this respect, interaction rank can point to the hubs of a network in terms of its overall structure, and can overcome some of the shortcomings of connectivity and cluster index.
To compute the steady state of the transition matrix, M, we need to find a configuration, x, such that M x = λ x for a maximal real number λ. In other words, we have to compute an eigenvector x for M for the maximal eigenvalue λ. There are standard libraries to do this, but since we require just the eigenvector for the largest eigenvalue, we used the power method, i.e. for a random initial configuration x o we iteratively compute M n x 0 for increasing n > 0 until M n x 0 converges. The elements of the resulting eigenvector represent the steady state probabilities of the Markov process M and constitute the interaction rank of the corresponding superfamilies.
Let us consider examples of superfamilies with high interaction rank. The top 25% superfamilies in PSIMAP's main component, according to interaction rank, form a connected component, and thus define the core of the whole interaction network ( figure 12). While a large number of neighbours usually implies a good interaction rank, there are examples such as alpha/beta-hydrolases (c.69.1) and the galactose oxidase, central domain (b.69.1) with few interaction partners, yet high interaction rank, as they have highly scoring neighbours. The alpha/beta-hydrolases (c.69.1) superfamily has only four interaction partners ( figure 12), but has a good interaction rank as its neighbourhood consists of two high ranking nodes (Galactose-binding domain-like (b.18.1) and Lipase/ lipooxygenase domain (PLAT/LH2 domain) (b.12.1)) and two medium ranking nodes (Prolyl oligopeptidase, N-terminal domain (b.69.7) and HAD-like (c.108.1)). Similarly, the Galactose oxidase, central domain, b.69.1, has a medium interaction rank despite it only having two interaction partners; however, these two partners have a very high interaction rank, which is reflected in b.69.1.
To summarise, we define a transition matrix M reflecting possible interactions between superfamilies. From the transition matrix we can computer the interaction rank of each superfamily and hence complement the measures of connectivity and cluster index. In contrast to connectivity, which considers only the direct neighbourhood of a superfamily, interaction rank takes the whole network topology into account. In contrast to cluster index, which favours vertices with few interaction partners, interaction rank increases with the number of interaction partners. Furthermore, interaction rank is capable of including additional probabilistic experimental information regarding likelihood of interaction by simply updating the transition matrix accordingly. This will be a powerful basis to customize interaction rank for a researcher's specific experiments and settings.

Taxonomic Diversity
All the above measures rate vertices according to the structure of the network. Here we introduce a measure that rates superfamilies according to their taxonomic diversity. Taxonomic diversity is related to age -the more diverse a superfamily, the older it is. We have addressed the question of whether a superfamily's taxonomic diversity, and thus its age can be related to its interactivity or location in the network. This would effectively enable us to predict age from the network structure.
To define taxonomic diversity, we used the NCBI taxonomic database [56] to count the number of species in which a superfamily's domains occur. As this species-level measure depends highly on the structure of the taxonomy (for example there are many more eukaryotes than prokaryotes), we complemented this count species-level count by also measuring the diversity at kingdom level. Kingdom-level diversity simply indicates whether a The top 25% superfamilies according to interaction rank form a highly connected component (left) Figure 12 The top 25% superfamilies according to interaction rank form a highly connected component (left). The superfamily alpha/beta-Hydrolases (c.69.1) has only four interaction partners, but has nonetheless a good interaction rank, as its neighbourhood consists of two good nodes (Galactose-binding domain-like (b.18.1) and Lipase/lipooxygenase domain (PLAT/LH2 domain) (b.12.1)) and two medium nodes (Prolyl oligopeptidase, N-terminal domain (b.69.7) and HAD-like (c.108.1)). superfamily occurs in 1, 2, 3, or 4 of the superkingdoms of archaea, bacteria, eukaryotes, and viruses. Using diversity measures, we can identify the oldest interactions and extract information about the evolution of the interaction network. The 10%, 20%, 30%, and 40% most highly diverse superfamilies in PSIMAP's main component are shown in Figure 13. Equating diversity to age, the series shows how the network developed through evolution. We further examined the core of the network: the 18 most highly diverse as shown in table 3 and their interactions as shown in Figure 14. These superfamilies can be considered the oldest, as they are the most highly diverse. It is important to note that these oldest superfamilies form one connected and (presumably ancient) component and do not break-up into different components.
Next, we want to relate the concept of taxonomic diversity to the other graph-theoretic measures. Can we predict the taxonomic diversity from structural properties in the network alone? At first glance, results appear to reject this: Eccentricity, sum of distance, and cluster index are correlated to neither of the diversity measures. Also, connectivity and interaction rank are only correlated with 0.25 to diversity at kingdom level. However, they show a reasonable correlation to diversity (both 0.56). Figure 15 shows this relationship in a scatter plot for connectivity. For diversity at kingdom level, both connectivity and interaction rank allow for the conclusion that superfamilies with high values occur in at least 3 superkingdom classes, while low values may or may not be spread across many kingdoms. Something similar holds in relation to diversity: Highly connected superfamilies and ones with a high interaction rank tend to occur in many species. However, superfamilies with low connectivity and interaction rank may or may not occur in many species. As a result, we can conclude that all highly interactive superfamilies are among the oldest.

Fault-tolerance, Attacks, and Convergent Evolution
It has been argued that many protein interaction networks are scale-free networks [44,57]. The scale-free property means that the vertex connectivity follows a power-law, i.e. there are few nodes that are highly connected, and many with low connectivity. This is also the case for PSI-MAP. There are over 400 superfamilies that have no interaction partners and the connectivity of the most highly connected superfamilies quickly tails off as discussed above (P-loop (46), Immunoglobulin (38), (Trans)glycosidases (14), 4Fe-4S ferredoxins (12), Cytochrome c (11),...). Formally, the graph of number of interaction partners (y-axis) and superfamilies (x-axis) has a trend line of y = 58.014x -0.7152 , which fits very well with the squared correlation coefficient R 2 = 0.9353 (data not shown). This confirms the power law property for PSI-MAP's largest component.
Scale-free networks such as PSIMAP have special properties. On the one hand, they are very fault-tolerant, in that the removal of a random vertex is not likely to disconnect a component. They are, however, prone to attacks, in that the removal of the most highly connected vertices severely affects the network. One small component of the PSIMAP interaction network consists of a methionine synthase domain (a. 46 Proteins in set one (methionine synthase) are linked to proteins in set two (methylmalonyl-CoA mutase) via the common superfamily, c.23.6 (Cobalamin binding domain) [58]. While the link between these two sets of proteins does not represent a direct physical interaction, it highlights the evolutionary connection between the two proteins (c.23.6 physically interacts with both d.173.1 and c. 1.19). The link also highlights the functional coupling of the two proteins mediated by the common cofactor, showing they are involved in related metabolic pathways and diseases [59].
Methionine synthase and methylmalonyl-CoA mutase have well described functions in higher organisms, while proteins in the third set, (glycerol dehydratase) are described as bacterial. The link between the Evolution of the interaction network Figure 13 Evolution of the interaction network. PSIMAP's main component with the top 10%, 20%, 30% and 40% according to diversity. methylmalonyl-CoA mutase and glycerol dehydratase is made via the common superfamily c.1.19 (cobalamin dependent enzymes). In this case we suspect that c.1.19 facilitates a true physiological interaction pathway between the superfamilies present in both species.
As argued above, the pathway in Figure 16 is very dependent on both the cobalamin (vitamin B12)-binding domain and the cobalamin (vitamin B12)-dependent enzymes being present in the network, as these two vertices are so-called articulation vertices, whose removal disconnects the component. If for this reason certain superfamilies are particularly important, then is there any evidence for back-up mechanisms? One way how to ensure that connectivity is maintained despite the removal of vertices in the network is to have multiple and entirely different paths connecting two superfamilies. Then the interruption of one path does not interrupt the network as a whole. In graph-theory a sub-graph in which all pairs of vertices are connected by at least two entirely different paths is called a bi-connected component. Any     This means that nearly all of the superfamilies in PSIMAP are not articulation vertices, i.e. e.g. removing any of these 115 superfamilies will not disconnect the largest component. Furthermore, the overview in Figure 17 highlights the comparatively few articulation vertices, which connect the main bi-connected component to the rest of the network, in pink. The P-loop is such a vertex, which links the main bi-connected component to the second largest biconnected component. Thus the P-loop is an articulation vertex and removal of the P-loop will separate these two bi-connected components.

Species
The right figure in Figure 17 shows a smaller bi-connected component consisting of the four superfamilies trypsinlike serine proteases (b.47.1), CI-2 family of serine protease inhibitors (d.40.1), protease propeptides/inhibitors (d. 58.3), and subtilisin-like (c.41.1). Similarly to the Ploop above, the removal of the subtilisin-like superfamily will disconnect the four superfamilies from the rest of the network reachable through the subtilisin-inhibitor (d.84.1). The bi-connected component shows that both the subtilisin-like and trypsin-like serine protease super-families can bind both the CI-2 serine protease inhibitors and the protease propeptides/inhibitors. This indicates that both protease superfamilies and both inhibitor superfamilies have a similar function, highlighting the two instances of functionally convergent evolution [60]. Thus, the graph-theoretic analysis of bi-connected components uncovers an instance of convergent evolution.

Conclusion
PSIMAP, a map of protein interactions at superfamily level, is computed using data from PDB and SCOP, and therefore provides a structural, robust, coarse-grained view of the interactome. In this paper, we have evaluated and justified PSIMAP, we have described the development of PSIEYE, a tool for large-scale interaction network analysis and visualization, and we have used PSIEYE to analyze PSIMAP and investigate several biologically significant questions.
We have evaluated and justified PSIMAP: First, we justified a threshold of 5 amino acid contacts at less than 5 Angstrom by considering interactions over the whole parameter space. Second, we justified interaction of covalently linked domains due to the use of SCOP. Third, we justified the approach of interaction at superfamily level by showing that superfamily size and number of interaction partners are not correlated.

We have developed PSIEYE, a tool for large-scale interaction network analysis and visualization:
We have implemented a host of graph-theoretic measures such as connectivity, cluster index, eccentricity, sum of distance, and bi-connectivity to characterise proteins and their interactions in the maps. We complemented these measures with our novel approach of using interaction rank, which views interactions as a Markov process. This allowed us to rank proteins by their interactivity, effectively combining aspects of connectivity and cluster index at a global scale. We have discussed how to compute interaction rank by computing the stable state of the Markov process. The interaction rank approach has also the advantage that it can be customized by taking additional information on the possibility and probability of specific interactions into account thus combining the large scale structural interaction map with e.g. experimentally determined data.
We analysed PSIMAP: We applied the graph theoretic and taxonomic network measures to answer biological questions.
First, we compared the superfamilies regarding their location within the network. We found that the center and barycenter of PSIMAP do not coincide and we characterised the function of the superfamilies at the center as enzymatic activity, with an emphasis on energy metabo- lism and macromolecular synthesis and at the barycenter as very general. This is also due to the superfamilies at the center being not as highly connected.
Second, we analysed PSIMAP with respect to the notion of cluster index and we related a high number of interaction partners and relatively high cluster index to potential complexes. To document this, we verified that a substantial part of the highly-connected neighbourhoods of three superfamilies belong to complex I and II. The subnetwork and connections between the various superfamilies is especially interesting, as it is one of the largest yet least well characterised protein complexes in the cell. This new information regarding potential interactions and arrangements between the subunits might lead to novel insights into the structure and evolution of complex I and it complements approaches such as the method proposed by Bader and Hogue [61].
Third, we have shown how to characterise the evolution of interaction networks. We identified the most highly diverse superfamilies and showed that starting from the 10% most highly diverse superfamilies, progressing to 20%, 30% and 40%, the network does not fragment into different components, but progressively extends itself. This behaviour very closely reflects preferential attachment as observed in scale-free networks. Additionally, we investigated whether graph-theoretic measures can be used to predict the diversity of a superfamily, and showed that only two measures, connectivity and interaction rank, have such a correlation. A detailed scatter plot clearly shows that highly interactive superfamilies are also highly diverse and thus among the oldest. Finally, the concept of bi-connected components was used in the identification of a particular subnetwork. Our example shows two superfamilies, the subtilisin-like and the trypsin-like serine protease superfamilies, as instances of functionally convergent evolution [60], as they both share the same interaction partners. Overall, PSIMAP and its graph-theoretical analysis unravel important aspects of the evolution of protein interaction networks. Forth, we followed a novel approach to the fault-tolerance of interaction networks. We applied the notion of articulation vertices, whose removal disconnects the network, and of biconnectivity, where at least two completely different paths exist between vertices, to PSIMAP. We obtained the remarkable result that there are only very few articulation vertices in PSIMAP and that 1/3 of the superfamilies in PSIMAP's main component belong to a single bi-connected component. This means that the network is very fault-tolerant as removal of any of superfamily that is not an articulation vertex does not disconnect the network. This verifies that PSIMAP is a very robust network.
The analyses we carried out for PSIMAP are general in nature and can be applied to other experimental interaction data such as BIND or DIP. Combination and further analysis of the network components of these two types of protein interaction data will lead to critical understanding of the interactome.
Overall, our graph-theoretic analysis of PSIMAP allowed us answer a number of biological questions. In particular, the analysis sheds light onto the evolution of the network, it uncovers the core of the network, identifies complexes, and the most important superfamilies in terms of the network's structure.

Authors' contributions
DB has generated the PSIMAP and taxonomic data used, evaluated the PSIMAP parameters, and analysed the faulttolerance example, PD implemented interaction rank through Eigenvector analysis, RH analysed the complex I and II data, JP developed the original PSIMAP, analysed the connectivity example and suggested fundamental questions, MS developed PSIEYE, the tool used for the analysis, conceived interaction rank and the other graphtheoretic measures, generated the examples using PSIEYE.