Visualisation and graph-theoretic analysis of a large-scale protein structural interactome
- Dan Bolser†2,
- Panos Dafas†1,
- Richard Harrington†2,
- Jong Park†2, 3 and
- Michael Schroeder†1Email author
© Bolser et al; licensee BioMed Central Ltd. 2003
Received: 31 July 2003
Accepted: 08 October 2003
Published: 08 October 2003
Large-scale protein interaction maps provide a new, global perspective with which to analyse protein function. PSIMAP, the Protein Structural Interactome Map, is a database of all the structurally observed interactions between superfamilies of protein domains with known three-dimensional structure in the PDB. PSIMAP incorporates both functional and evolutionary information into a single network.
We present a global analysis of PSIMAP using several distinct network measures relating to centrality, interactivity, fault-tolerance, and taxonomic diversity. We found the following results: Centrality: we show that the center and barycenter of PSIMAP do not coincide, and that the superfamilies forming the barycenter relate to very general functions, while those constituting the center relate to enzymatic activity. Interactivity: we identify the P-loop and immunoglobulin superfamilies as the most highly interactive. We successfully use connectivity and cluster index, which characterise the connectivity of a superfamily's neighbourhood, to discover superfamilies of complex I and II. This is particularly significant as the structure of complex I is not yet solved. Taxonomic diversity: we found that highly interactive superfamilies are in general taxonomically very diverse and are thus amongst the oldest. Fault-tolerance: we found that the network is very robust as for the majority of superfamilies removal from the network will not break up the network.
Overall, we can single out the P-loop containing nucleotide triphosphate hydrolases superfamily as it is the most highly connected and has the highest taxonomic diversity. In addition, this superfamily has the highest interaction rank, is the barycenter of the network (it has the shortest average path to every other superfamily in the network), and is an articulation vertex, whose removal will disconnect the network. More generally, we conclude that the graph-theoretic and taxonomic analysis of PSIMAP is an important step towards the understanding of protein function and could be an important tool for tracing the evolution of life at the molecular level.
KeywordsStructural Interactome Protein Interaction Interactomics Graph-theory Interaction Rank Taxonomic Diversity PSIEYE PSIMAP.
Large-scale protein interaction maps [1–9] have increased our understanding of protein function, extending 'functional context' to the network of interactions which span the proteome [10–13]. Functional genomics has fuelled this new perspective and has directed research towards computational methods of reconstructing genome-scale interaction maps.
One group of computational methods uses the abundant genomic sequence data, and is based on the assumption that genomic proximity and gene fusion result from a selective pressure to genetically link proteins which physically interact [14–16]. With the exception of conserved operons and gene fusion, however, genomic proximity is more generally indicative of indirect functional associations between proteins  than direct interactions between the gene products.
A second group of methods, based on the assumption that protein-protein interactions are conserved across species, was originally applied to genomic comparisons . Just as common function can be inferred between homologous proteins, 'homologous interaction' can be used to infer interaction between homologues of interacting proteins. This method has been validated in a comparison between PSIMAP, which contains observed protein domain interactions in the Protein Data Bank (PDB)  and experimentally determined domain interactions in yeast . The method has also been systematically validated at the sequence level using BLAST , and has been improved by the use of a statistical domain level representation of the known protein interactions [22, 23].
PSIMAP, the Protein Structural Interactome Map , is a database of all the structurally observed interactions between protein domains of known three-dimensional structure in the PDB. It can be constructed using any reliable protein domain definition, where domains are defined as evolutionarily conserved structural and functional protein units. Here we use the domain definitions provided by SCOP (Structural Classification of Proteins) , which uses structural and functional homology to manually define evolutionarily distinct protein domain families and superfamilies. Alternatively, other domain definitions (such as CATH , FSSP , Pfam , etc.) can be used.
By viewing interaction between superfamilies, which encompass extremely distant evolutionary relationships , PSIMAP represents domain interaction within a broad evolutionary context. The analysis of PSIMAP's network topology presented here necessarily incorporates this evolutionary perspective.
Breakdown of superfamily-superfamily interactions according to inter- and intra-interactions for homomers and heteromers. Intra-interactions are in the minority.
To check for potential sampling error in the PDB, we checked if the absolute number of domains in a superfamily is correlated to the number of observed interactions that that superfamily makes. We did not find significant evidence for this correlation: omitting four outliers, the correlation coefficient between the number of interactions and number of domains in a superfamily is only 0.16. This suggests that a superfamily's interactivity is independent of its occurrence in the PDB.
Visualizing structurally observed protein domain interaction at the superfamily level gives a very robust network which incorporates both a broad evolutionary perspective of protein interaction with the conserved structural and functional features of protein domains. PSIMAP, therefore, represents a rich, stable overview of the protein interactome.
Which superfamilies can directly or indirectly interact with each other forming subnetworks (does PSIMAP contain evolutionarily distinct interaction networks)?
Which superfamilies can disrupt a pathway in the network if removed (highlighting critical pathways or distinct functional contexts for superfamilies in PSIMAP)?
Are there multiple indirect interactions between superfamilies (making the overall superfamily interaction network topology robust, highlighting sets of superfamilies with common functional roles)?
How many interaction partners has a superfamily acquired over the course of evolution?
How well connected is the neighbourhood of a superfamily (how is a superfamily related to the rest of the network)?
How central is a superfamily in the network (which superfamilies make the most fundamental contribution to the overall network)?
Is there a core in the network (has the network grown from a core, critical set of interactions)?
How is the network distributed within and between taxonomic groups with respect to the above measures (how diverse is the interaction network in nature)?
The last question is particularly relevant for PSIMAP, as it is based on a reliable definition of homology applied to all the available multi-domain protein structures in the PDB. PSIMAP forms a global interaction network across many species, which can be extended using sequence-based homology searches . We applied graph-theoretic and taxonomic measures to PSIMAP using the PSIEYE tool to answer the above questions.
The PSIMAP algorithm generates a large network consisting of 937 superfamilies and 538 interactions, with 512 distinct components ranging in size from the single largest component of 320, which will be the basis for further analysis, to some 400 isolated non-interacting single superfamilies, distributed according to a power-law. These and all subsequent analysis are based on PSIMAP produced from SCOP version 1.59. To analyse PSIMAP we will follow two strands: First, we looked at the network topology of the map in terms of location and interactivity. Second, we characterized the taxonomic diversity of the superfamilies. The former analysis can be broken down into two distinct aspects: location and the interactivity.
Previously, it has been shown that central proteins in an interaction network are often functionally critical and their removal correlates to lethality . Wuchty and Stadler define three types of centrality and apply them to metabolic and protein interaction networks .
The center of the network is in the same neighbourhood as the barycenter with two superfamilies (NTN hydrolases, d.153.1 and nucleotidylyl transferase, c.26.1) in common within the six highest of both centrality measures. There are six centers in the PSIMAP network with equally small eccentricity. They are PK beta-barrel domain-like (b.58.1), Nucleotidylyl transferase (c.26.1), Ntn hydrolases (d.153.1), FMN-linked oxidoreductases (c.1.4), HPr-like (d.94.1) and Adenine nucleotide alpha hydrolases (c.26.2). As with the superfamilies which rank highly in the barycenter measurement, these superfamilies are involved in highly critical cellular functions including glycolysis; galactose / fructose metabolism and nucleotide, amino acid, lipopolysaccharide, NAD and ATP synthesis. In comparison to the barycenter superfamilies, members of the center are related to more specific enzymatic activities with an emphasis on energy metabolism and macromolecular synthesis. Conversely, members of the barycenter mediate their function via structural interactions, involving molecular switching, signalling, transport, DNA binding and protein-protein interaction. Additionally, the majority of the observed enzymatic functions in the barycenter can be attributed to the ubiquitous P-loop domain.
The slight shift in topology between the center and the barycenter in PSIMAP reflects a slight shift in the functional characteristics of the overlapping subgroups of superfamilies in the topological region. Intuitively, we hypothesize that those critical superfamilies which have general functions or a predominantly structural mode of action will have a greater number of interaction partners (which is a requirement for the highest sum of distance in the barycenter). More specific but none the less critical enzymatic functions on the other hand will be associated with many different pathways, but may mediate indirect functional roles via common metabolites and thus need not make direct physical interactions with many different members of the network.
The 19 most interactive superfamilies.
P-loop containing nucleotide triphosphate hydrolases
Winged helix DNA-binding domain
Nucleic acid-binding proteins
Glutathione synthetase ATP-binding domain-like
NAD(P)-binding Rossmann-fold domains
Actin-like ATPase domain
Protein kinase-like (PK-like)
Metalloproteases (zincins) catalytic domain
The majority of the most highly connected superfamilies contain families of functionally important enzymes, with only three main exceptions. They are: 1) domains from the Immunoglobulin superfamily (b.1.1), frequently found as domain linkers in genomic sequences and structures, having diverse structural roles and interacting with many different proteins; 2) domains from the EF hand all-alpha superfamily (a.39.1), a structural motif (with an average size of around 40 amino acids) involved in calcium binding and the diverse regulatory functions associated with calcium; 3) the winged helix DNA-binding domain (a.4.5), which has an extremely diverse set of functions related to DNA binding. For example, the winged helix domain associates with many different small molecule binding domains to form functionally diverse families of transcription factors in prokaryotes and eukaryotes .
The most highly interactive superfamilies take part in a wide range of critical cellular reactions, mostly relating to energy metabolism and catabolism as well as signalling and structural roles. For example, the iron-sulfur proteins (d.58.1 and d.15.4) transfer electrons in a wide variety of metabolic reactions, indicating a very early origin in protein evolution. The PK-like superfamily (d.144.1) encompasses enzymes that belong to a very extensive family of proteins involved in almost all aspects of eukaryotic signal transduction pathways, including regulation of the cell cycle, differentiation, homeostasis and the immune response. Members of this superfamily share a conserved catalytic core common with both serine/threonine and tyrosine protein kinases , and have related but uncharacterised counterparts in archae as well as functional homologues in viruses. Ubiquitin-like superfamily domains are found in an extremely broad range of protein families, having structural roles in proteolysis (including the unfolded protein response pathway), and linking cytoskeleton proteins to proteins in the plasma membrane, as well as having roles in signal transduction. Raf-like and Ras-binding activity, guanine nucleotide exchange activity and GTP activated Phosphatidylinositol 3-kinase activity as part of the phosphatidylinositol 3-kinase complex . This superfamily is also involved in DNA repair mechanisms, chromosome segregation, viral infection, splicing, autophagy and the regulation of membrane physical properties and cell development.
Indirect connectivity is useful in that it links the measures of connectivity to the location measures discussed above. Due to the particular structure of the network, the most highly interactive and central superfamilies also have higher indirect connectivity than peripheral superfamilies. Indirect connectivity shifts our perspective from a purely local view of interactivity towards a more global view. However, connectivity and indirect connectivity do not quantify interaction density, a measure to describe the extent to which a superfamily's interaction partners interact with each other.
Cluster index  is a measure of interaction density, and is defined as the number of interactions between a vertex's neighbours divided by the total number of possible interactions between them. A cluster index of 0 means that none of a vertex's neighbours interact, whereas a cluster index of 1 indicates that they all interact with each other.
The interaction partners of the 3 superfamilies d.168.1, a.1.2, and c.3.1, noted above overlap considerably to form a well-connected subnetwork. Analysis of the members of this subnetwork reveals that they correlate closely to various members of the mitochondrial respiratory chain. In particular, they match subunits of complex I and complex II, indicating that perhaps this subnetwork is representative of the two complexes and the interactions between their subunits. This could be highly significant as, whilst the structure of the complex II has been solved, the structure of complex I has not yet been elucidated.
The respiratory chain involves a series of membrane-bound proteins that use a series of electron transfer steps to create a proton gradient across the mitochondrial membrane. This proton gradient is then used as the driving force for ATP synthesis. Complex I is the first protein complex in the respiratory chain. It is part of a redox reaction, catalysing the oxidation of NADH from the citric acid cycle along with the reduction of ubiquinone. The oxidation of NADH is coupled to electron transfer via a Flavin MonoNucleotide (FMN) prosthetic group, which acts as a first acceptor of electrons from NADH. Electron transfer is carried on through complex I by several iron-sulphur (FeS) clusters in the protein. Complex II (succinate ubiquinone oxidoreductase) is the second protein complex in the chain. It is involved in a different redox reaction, catalysing the oxidation of succinate (also a product of the citric acid cycle) to fumarate, along with the reduction of ubiquinone to ubiquinol. Succinate is oxidised by using the bound FAD on the 70 kDa subunit as an electron acceptor. As in complex I, several FeS clusters, which are found in the 27 kDa subunit, help in electron transfer through the protein.
To answer whether complex I and complex II relate to the above networks, we mapped known complex I and complex II protein subunits to their SCOP superfamilies via PSI-Blast . Assigned complex I superfamilies account for the majority of the superfamilies in the smaller network (Figure 10), which shows alpha-helical ferredoxin (a.1.2) and its neighbours, and on the left half of the larger subnetwork (Figure 11), which shows the FAD/NAD(P)-binding domain and its neighbours. Complex II superfamilies account for at least 5 of the other superfamily nodes in the subnetwork in Figure 11. To be more precise, the proteins P15960, P34943, P42028, P15690, which are known subunits of complex I map to SCOP superfamilies d.15.4, c.4.1, d.58.1, and a.1.2, respectively, all of which are members of the smaller subnetwork. Furthermore, the other 2 superfamilies of this network, FMN linked oxidoreductase (c.1.4) and FAD/NAD (P) binding domain (c.3.1), are functionally significant to complex I. Additionally, we found that proteins Q09545 and Q09508 of complex II map to d.15.4, a.1.2, a.7.3, d.168.1, c.3.1. As with the prior example, other superfamily members of this network, multiheme cytochromes (a.138.1), FAD/NAD-linked reductases, dimerisation (C-terminal) domain (d.87.1), and thioredoxin-like (c.47.1), are all functionally related to the action of complex II.
The above findings show that 9 out of 11 neighbours of the FAD/NAD (P) binding domain belong to or are related to either complex I or complex II, or both. A subnetwork has been identified around this highly connected superfamily that has a comparatively high cluster index. This stresses both the importance of the superfamily and also the importance of connectivity and cluster index as a measure that is especially useful in uncovering complexes.
Both connectivity and cluster index have shortcomings: Connectivity does not consider interactions in a vertex's neighbourhood; cluster index favours low connectivity vertices. To get a better measure for the wider neighbourhood of a vertex, we have developed the idea of interaction rank, which treats interaction networks as Markov processes. In this analysis, each edge in the network is equated with a state transition in a Markov process. A similar approach has been used for the analysis of clusters in a network . For example, a superfamily with a certain number of interaction partners, p, corresponds to a state, v, in the Markov process with p possible successor states w ∈ N(v), (where N(v) is the set of v's neighbours). A priori, each of the transitions is chosen with the same likelihood, giving a 1/|N(v)| chance for v to 'interact' with w ∈ N(v), where |N(v)| is the size of the set. If we enumerate all vertices from v1 to v n , we can capture this Markov process as a transition matrix M = (m ij ), where for all 1 ≤ i, j ≤ n entries, m ij = 1/|N(v i )| if v i is connected to v j or 0 otherwise. If we compute the steady state transition probabilities of this process, we can rate vertices according to this notion of 'interactivity'. We call this rating 'interaction rank'. Essentially, the more interaction partners a superfamily has, the better its interaction rank. Also, the better connected a superfamily's neighbourhood, the better the interaction rank. These two trends are intuitively a consequence of the increased probability of indirectly returning back to a superfamily via the interconnections between its interaction partners. In this way, interaction rank combines aspects of connectivity and cluster index. It does so at a global scale incorporating information about the topology of the whole network. In this respect, interaction rank can point to the hubs of a network in terms of its overall structure, and can overcome some of the shortcomings of connectivity and cluster index.
To compute the steady state of the transition matrix, M, we need to find a configuration, x, such that M x = λ x for a maximal real number λ. In other words, we have to compute an eigenvector x for M for the maximal eigenvalue λ. There are standard libraries to do this, but since we require just the eigenvector for the largest eigenvalue, we used the power method, i.e. for a random initial configuration x o we iteratively compute M n x0 for increasing n > 0 until M n x0 converges. The elements of the resulting eigenvector represent the steady state probabilities of the Markov process M and constitute the interaction rank of the corresponding superfamilies.
To summarise, we define a transition matrix M reflecting possible interactions between superfamilies. From the transition matrix we can computer the interaction rank of each superfamily and hence complement the measures of connectivity and cluster index. In contrast to connectivity, which considers only the direct neighbourhood of a superfamily, interaction rank takes the whole network topology into account. In contrast to cluster index, which favours vertices with few interaction partners, interaction rank increases with the number of interaction partners. Furthermore, interaction rank is capable of including additional probabilistic experimental information regarding likelihood of interaction by simply updating the transition matrix accordingly. This will be a powerful basis to customize interaction rank for a researcher's specific experiments and settings.
All the above measures rate vertices according to the structure of the network. Here we introduce a measure that rates superfamilies according to their taxonomic diversity. Taxonomic diversity is related to age – the more diverse a superfamily, the older it is. We have addressed the question of whether a superfamily's taxonomic diversity, and thus its age can be related to its interactivity or location in the network. This would effectively enable us to predict age from the network structure.
The 18 most highly diverse and hence oldest superfamilies. The species column indicates the number of species this superfamily occurs in, the superkingdom column indicates whether the superfamily occurs in eukaryota (E), archaea (A), bacteria (B), and viruses (V), connectivity refers to the number of interaction partners.
P-loop containing nucleotide triphosphate hydrolases
NAD(P)-binding Rossmann-fold domains
Actin-like ATPase domain
Protein kinase-like (PK-like)
Winged helix DNA-binding domain
Glutathione synthetase ATP-binding domain-like
Metalloproteases (zincins) catalytic domain
Fault-tolerance, Attacks, and Convergent Evolution
It has been argued that many protein interaction networks are scale-free networks [44, 57]. The scale-free property means that the vertex connectivity follows a power-law, i.e. there are few nodes that are highly connected, and many with low connectivity. This is also the case for PSIMAP. There are over 400 superfamilies that have no interaction partners and the connectivity of the most highly connected superfamilies quickly tails off as discussed above (P-loop (46), Immunoglobulin (38), (Trans)glycosidases (14), 4Fe-4S ferredoxins (12), Cytochrome c (11),...). Formally, the graph of number of interaction partners (y-axis) and superfamilies (x-axis) has a trend line of y = 58.014x-0.7152, which fits very well with the squared correlation coefficient R2 = 0.9353 (data not shown). This confirms the power law property for PSIMAP's largest component.
Scale-free networks such as PSIMAP have special properties. On the one hand, they are very fault-tolerant, in that the removal of a random vertex is not likely to disconnect a component. They are, however, prone to attacks, in that the removal of the most highly connected vertices severely affects the network. One small component of the PSIMAP interaction network consists of a methionine synthase domain (a.46.1) interacting with a methionine synthase activation domain (d.173.1) interacting with a cobalamin (vitamin B12)-binding domain (c.23.6) interacting with cobalamin (vitamin B12)-dependent enzymes (c.1.19), which in turn interacts with both a diol dehydratase gamma subunit (a.23.2) and a diol dehydratase beta subunit (c.51.3), as shown in figure 16. In this subnetwork, the cobalamin (vitamin B12)-binding domain and the cobalamin (vitamin B12)-dependent enzyme have a common role, as their removal disconnects the component, i.e. without superfamilies c.1.19 and c.23.6, methionine synthase domains a.46.1 and d.173.1 cannot interact with the diol dehydratase subunits a.23.2 and c.51.3. In graph-theory, such vertices are called 'articulation vertices', and, by definition, their removal disconnects the network.
Methionine synthase: PDB entries 1k7y and 1k98 link SCOP superfamilies a.46.1, d.173.1 and c.23.6.
Methylmalonyl-CoA mutase: PDB entries 1cb7, 1ccw, 1e1c, 1i9c and 1-7req link SCOP superfamilies c.23.6 and c.1.19.
Glycerol dehydratase: PDB entries 1dio, 1eex, 1egm and 1egv link SCOP superfamilies c.1.19, a.23.2, and c.51.3.
Proteins in set one (methionine synthase) are linked to proteins in set two (methylmalonyl-CoA mutase) via the common superfamily, c.23.6 (Cobalamin binding domain) . While the link between these two sets of proteins does not represent a direct physical interaction, it highlights the evolutionary connection between the two proteins (c.23.6 physically interacts with both d.173.1 and c.1.19). The link also highlights the functional coupling of the two proteins mediated by the common cofactor, showing they are involved in related metabolic pathways and diseases .
Methionine synthase and methylmalonyl-CoA mutase have well described functions in higher organisms, while proteins in the third set, (glycerol dehydratase) are described as bacterial. The link between the methylmalonyl-CoA mutase and glycerol dehydratase is made via the common superfamily c.1.19 (cobalamin dependent enzymes). In this case we suspect that c.1.19 facilitates a true physiological interaction pathway between the superfamilies present in both species.
As argued above, the pathway in Figure 16 is very dependent on both the cobalamin (vitamin B12)-binding domain and the cobalamin (vitamin B12)-dependent enzymes being present in the network, as these two vertices are so-called articulation vertices, whose removal disconnects the component. If for this reason certain superfamilies are particularly important, then is there any evidence for back-up mechanisms? One way how to ensure that connectivity is maintained despite the removal of vertices in the network is to have multiple and entirely different paths connecting two superfamilies. Then the interruption of one path does not interrupt the network as a whole. In graph-theory a sub-graph in which all pairs of vertices are connected by at least two entirely different paths is called a bi-connected component. Any vertex in a bi-connected component can be removed without breaking the network into separate components.
The right figure in Figure 17 shows a smaller bi-connected component consisting of the four superfamilies trypsin-like serine proteases (b.47.1), CI-2 family of serine protease inhibitors (d.40.1), protease propeptides/inhibitors (d.58.3), and subtilisin-like (c.41.1). Similarly to the P-loop above, the removal of the subtilisin-like superfamily will disconnect the four superfamilies from the rest of the network reachable through the subtilisin-inhibitor (d.84.1). The bi-connected component shows that both the subtilisin-like and trypsin-like serine protease superfamilies can bind both the CI-2 serine protease inhibitors and the protease propeptides/inhibitors. This indicates that both protease superfamilies and both inhibitor superfamilies have a similar function, highlighting the two instances of functionally convergent evolution . Thus, the graph-theoretic analysis of bi-connected components uncovers an instance of convergent evolution.
PSIMAP, a map of protein interactions at superfamily level, is computed using data from PDB and SCOP, and therefore provides a structural, robust, coarse-grained view of the interactome. In this paper, we have evaluated and justified PSIMAP, we have described the development of PSIEYE, a tool for large-scale interaction network analysis and visualization, and we have used PSIEYE to analyze PSIMAP and investigate several biologically significant questions.
We have evaluated and justified PSIMAP: First, we justified a threshold of 5 amino acid contacts at less than 5 Angstrom by considering interactions over the whole parameter space. Second, we justified interaction of covalently linked domains due to the use of SCOP. Third, we justified the approach of interaction at superfamily level by showing that superfamily size and number of interaction partners are not correlated.
We have developed PSIEYE, a tool for large-scale interaction network analysis and visualization: We have implemented a host of graph-theoretic measures such as connectivity, cluster index, eccentricity, sum of distance, and bi-connectivity to characterise proteins and their interactions in the maps. We complemented these measures with our novel approach of using interaction rank, which views interactions as a Markov process. This allowed us to rank proteins by their interactivity, effectively combining aspects of connectivity and cluster index at a global scale. We have discussed how to compute interaction rank by computing the stable state of the Markov process. The interaction rank approach has also the advantage that it can be customized by taking additional information on the possibility and probability of specific interactions into account thus combining the large scale structural interaction map with e.g. experimentally determined data.
We analysed PSIMAP: We applied the graph theoretic and taxonomic network measures to answer biological questions.
First, we compared the superfamilies regarding their location within the network. We found that the center and barycenter of PSIMAP do not coincide and we characterised the function of the superfamilies at the center as enzymatic activity, with an emphasis on energy metabolism and macromolecular synthesis and at the barycenter as very general. This is also due to the superfamilies at the center being not as highly connected.
Second, we analysed PSIMAP with respect to the notion of cluster index and we related a high number of interaction partners and relatively high cluster index to potential complexes. To document this, we verified that a substantial part of the highly-connected neighbourhoods of three superfamilies belong to complex I and II. The subnetwork and connections between the various superfamilies is especially interesting, as it is one of the largest yet least well characterised protein complexes in the cell. This new information regarding potential interactions and arrangements between the subunits might lead to novel insights into the structure and evolution of complex I and it complements approaches such as the method proposed by Bader and Hogue .
Third, we have shown how to characterise the evolution of interaction networks. We identified the most highly diverse superfamilies and showed that starting from the 10% most highly diverse superfamilies, progressing to 20%, 30% and 40%, the network does not fragment into different components, but progressively extends itself. This behaviour very closely reflects preferential attachment as observed in scale-free networks. Additionally, we investigated whether graph-theoretic measures can be used to predict the diversity of a superfamily, and showed that only two measures, connectivity and interaction rank, have such a correlation. A detailed scatter plot clearly shows that highly interactive superfamilies are also highly diverse and thus among the oldest. Finally, the concept of bi-connected components was used in the identification of a particular subnetwork. Our example shows two superfamilies, the subtilisin-like and the trypsin-like serine protease superfamilies, as instances of functionally convergent evolution , as they both share the same interaction partners. Overall, PSIMAP and its graph-theoretical analysis unravel important aspects of the evolution of protein interaction networks. Forth, we followed a novel approach to the fault-tolerance of interaction networks. We applied the notion of articulation vertices, whose removal disconnects the network, and of biconnectivity, where at least two completely different paths exist between vertices, to PSIMAP. We obtained the remarkable result that there are only very few articulation vertices in PSIMAP and that 1/3 of the superfamilies in PSIMAP's main component belong to a single bi-connected component. This means that the network is very fault-tolerant as removal of any of superfamily that is not an articulation vertex does not disconnect the network. This verifies that PSIMAP is a very robust network.
The analyses we carried out for PSIMAP are general in nature and can be applied to other experimental interaction data such as BIND or DIP. Combination and further analysis of the network components of these two types of protein interaction data will lead to critical understanding of the interactome.
Overall, our graph-theoretic analysis of PSIMAP allowed us answer a number of biological questions. In particular, the analysis sheds light onto the evolution of the network, it uncovers the core of the network, identifies complexes, and the most important superfamilies in terms of the network's structure.
Jong Park was partly supported by the Ministry of Information and Communication of South Korea under grant number IMT2000-C3-4. JP acknowledges the support of MRC-DUNN in the previous period of stay.
- Ito T, et al.: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001, 98(8):4569–4574. 10.1073/pnas.061034498PubMed CentralView ArticlePubMedGoogle Scholar
- McCraith S, et al.: Genome-wide analysis of vaccinia virus protein-protein interactions. Proc Natl Acad Sci USA 2000, 97(9):4879–4884. 10.1073/pnas.080078197PubMed CentralView ArticlePubMedGoogle Scholar
- Uetz P, et al.: A comprehensive analysis of protein-protein interactions in: Saccharomyces cerevisiae. Nature 2000, 403(6770):623–7. 10.1038/35001009View ArticlePubMedGoogle Scholar
- Walhout AJ, et al.: Protein Interaction Mapping in C. elegans Using Proteins Involved in Vulval Development. Science 1999, 5450: 116–121.Google Scholar
- Fromont-Racine M, et al.: Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins. Yeast 2000, 17(2):95–110. 10.1002/1097-0061(20000630)17:2<95::AID-YEA16>3.0.CO;2-HPubMed CentralView ArticlePubMedGoogle Scholar
- Fromont-Racine M, Rain JC, Legrain P: Toward a functional analysis of the yeast genome through exhaustive two-hybrid screens. Nat Genet 1997, 16(3):277–82.View ArticlePubMedGoogle Scholar
- Ito T, et al.: Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci U S A 2000, 97(3):1143–7. 10.1073/pnas.97.3.1143PubMed CentralView ArticlePubMedGoogle Scholar
- Flajolet M, et al.: A genomic approach of the hepatitis C virus generates a protein interaction map. Gene 2000, 242(1–2):369–79. 10.1016/S0378-1119(99)00511-9View ArticlePubMedGoogle Scholar
- Rain JC, et al.: The protein-protein interaction map of Helicobacter pylori. Nature 2001, 409(6817):211–5. 10.1038/35051615View ArticlePubMedGoogle Scholar
- Hartwell LH, et al.: From molecular to modular cell biology. Nature 1999, 6761(Supp 1):C47-C54. 10.1038/35011540View ArticleGoogle Scholar
- Vidal M: A Biological Atlas of Functional Maps. Cell 2001, 104(3):333–340.View ArticlePubMedGoogle Scholar
- Fellenberg M, et al.: Integrative Analysis of Protein Interaction Data. in Intelligent systems for molecular biology La Jolla, CA: AAAI Press 2000.Google Scholar
- Lappe M, et al.: Generating protein interaction maps from incomplete data: application to fold assignment. Bioinformatics 2001, 17(Suppl 1):S149–56.View ArticlePubMedGoogle Scholar
- Marcotte EM, et al.: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285(5428):751–3. 10.1126/science.285.5428.751View ArticlePubMedGoogle Scholar
- Dandekar T, et al.: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, 23(9):324–8. 10.1016/S0968-0004(98)01274-2View ArticlePubMedGoogle Scholar
- Enright AJ, et al.: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402(6757):86–90. 10.1038/47056View ArticlePubMedGoogle Scholar
- Huynen M, et al.: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 2000, 10(8):1204–10. 10.1101/gr.10.8.1204PubMed CentralView ArticlePubMedGoogle Scholar
- Pellegrini M, et al.: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285–8. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, et al.: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235–42. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Park J, Lappe M, Teichmann SA: Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J Mol Biol 2001, 307(3):929–38. 10.1006/jmbi.2001.4526View ArticlePubMedGoogle Scholar
- Matthews LR, et al.: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res 2001, 11(12):2120–6. 10.1101/gr.205301PubMed CentralView ArticlePubMedGoogle Scholar
- Wojcik J, Schachter V: Protein-protein interaction map inference using interacting domain profile pairs. Bioinformatics 2001, 17(Suppl 1):S296–305.View ArticlePubMedGoogle Scholar
- Deng M, et al.: Inferring domain-domain interactions from protein-protein interactions. Genome Res 2002, 12(10):1540–8. 10.1101/gr.153002PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin AG, et al.: SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures. Journal of Molecular Biology 1995, 247(4):536. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Orengo CA, et al.: CATH – a hierarchic classification of protein domain structures. Structure 1997, 5: 1093–108.View ArticlePubMedGoogle Scholar
- Holm L, Sander C: Mapping the protein universe. Science 1996, 273: 595–603.View ArticlePubMedGoogle Scholar
- Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. in Proteins 1997, 28: 405–420. Publisher Full Text 10.1002/(SICI)1097-0134(199707)28:3%3C405::AID-PROT10%3E3.0.CO;2-LView ArticleGoogle Scholar
- Aloy P, Russell RB: Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci USA 2002, 99(9):5896–5901. 10.1073/pnas.092147999PubMed CentralView ArticlePubMedGoogle Scholar
- Chothia C: Proteins. One thousand families for the molecular biologist. Nature 1992, 357(6379):543–4. 10.1038/357543a0View ArticlePubMedGoogle Scholar
- Orengo CA, Jones DT, Thornton JM: Protein superfamilies and domain superfolds. Nature 1994, 372(6507):631–4. 10.1038/372631a0View ArticlePubMedGoogle Scholar
- Alexandrov NN, Go N: Biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins. Protein Sci 1994, 3(6):866–75.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang ZX: How many fold types of protein are there in nature? Proteins 1996, 26(2):186–91. Publisher Full Text 10.1002/(SICI)1097-0134(199610)26:2%3C186::AID-PROT8%3E3.3.CO;2-3View ArticlePubMedGoogle Scholar
- Zhang CT.: Relations of the numbers of protein sequences, families and folds. Protein Engineering 1997, 10(7):757–761. 10.1093/protein/10.7.757View ArticlePubMedGoogle Scholar
- Gough J, et al.: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 2001, 313(4):903–19. 10.1006/jmbi.2001.5080View ArticlePubMedGoogle Scholar
- Tsai CJ, et al.: Protein-protein interfaces: architectures and interactions in protein-protein interfaces and in protein cores. Their similarities and differences. Crit Rev Biochem Mol Biol 1996, 31(2):127–52.View ArticlePubMedGoogle Scholar
- Bennett MJ, Choe S, Eisenberg D: Domain swapping: entangling alliances between proteins. Proc Natl Acad Sci U S A 1994, 91(8):3127–31.PubMed CentralView ArticlePubMedGoogle Scholar
- Miller S: The structure of interfaces between subunits of dimeric and tetrameric proteins. Protein Eng 1989, 3(2):77–83.View ArticlePubMedGoogle Scholar
- Jones S, Marin A, Thornton JM: Protein domain interfaces: characterization and comparison with oligomeric protein interfaces. Protein Eng 2000, 13(2):77–82. 10.1093/protein/13.2.77View ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: BIND – a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 2000, 16(5):465–77. 10.1093/bioinformatics/16.5.465View ArticlePubMedGoogle Scholar
- Xenarios I, et al.: DIP: the Database of Interacting Proteins. Nucleic Acids Research 2000, 28(1):289–291. 10.1093/nar/28.1.289PubMed CentralView ArticlePubMedGoogle Scholar
- Ju BH, et al.: Visualization and analysis of protein interactions. Bioinformatics 2003, 19: 317–318. 10.1093/bioinformatics/19.2.317View ArticlePubMedGoogle Scholar
- Enright AJ, Ouzounis CA: BioLayout-an automatic graph layout algorithm for similarity visualization. Bioinformatics 2001, 17(9):853–854. 10.1093/bioinformatics/17.9.853View ArticlePubMedGoogle Scholar
- Mrowka R: A Java applet for visualizing protein-protein interaction. Bioinformatics 2001, 17(7):669–670. 10.1093/bioinformatics/17.7.669View ArticlePubMedGoogle Scholar
- Jeong H, et al.: Lethality and centrality in protein networks. Nature 2001, 6833: 41. 10.1038/35075138View ArticleGoogle Scholar
- Wuchty S, Stadler PF: Centers of complex networks. J Theor Biol 2003, 223(1):45–53. 10.1016/S0022-5193(03)00071-7View ArticlePubMedGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nature Biotechnology 2000, 18(12):1257–1261. 10.1038/82360View ArticlePubMedGoogle Scholar
- Schroder M, et al.: PSIEYE: A tool for the graph-theoretic analysis of protein interaction networks. Bioinformatics [submitted]Google Scholar
- Park J, Bolser D: Conservation of Protein Interaction Network in Evolution. Genome Informatics Series 2001, 135–140.Google Scholar
- Hemmingsen SM, et al.: Homologous plant and bacterial proteins chaperone oligomeric protein assembly. Nature 1988, 333(6171):330–4. 10.1038/333330a0View ArticlePubMedGoogle Scholar
- Anantharaman V, Koonin EV, Aravind L: Regulatory Potential, Phyletic Distribution and Evolution of Ancient, Intracellular Small-molecule-binding Domains. Journal of Molecular Biology 2001, 307(5):1271–1292. 10.1006/jmbi.2001.4508View ArticlePubMedGoogle Scholar
- Hanks SK, Hunter T: Protein kinases 6: The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification. Faseb Journal 1995, 9(8):576.PubMedGoogle Scholar
- Djordjevic S, Driscoll PC: Structural insight into substrate specificity and regulatory mechanisms of phosphoinositide 3-kinases. Trends in Biochemical Sciences 2002, 27(8):426–432. 10.1016/S0968-0004(02)02136-9View ArticlePubMedGoogle Scholar
- Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393(6684):440–2. 10.1038/30918View ArticlePubMedGoogle Scholar
- Altschul SF, et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Dongen Sv: Graph Clustering by Flow Simulation. PhD thesis. University of Utrecht, Centre for Mathematics and Computer Science 2000.Google Scholar
- Wheeler DL, et al.: Database resources of the National Center for Biotechnology: Information. Nucleic Acids Res 2000, 28(1):10–4. 10.1093/nar/28.1.10PubMed CentralView ArticlePubMedGoogle Scholar
- Wagner A, Fell DA: The small world inside large metabolic networks. Proc R Soc Lond Biol Sci 2001, 1478: 1803–1810. 10.1098/rspb.2001.1711View ArticleGoogle Scholar
- Christensen B, et al.: Homocysteine remethylation during nitrous oxide exposure of: cells cultured in media containing various concentrations of folates. J Pharmacol Exp Ther 1992, 261(3):1096–105.PubMedGoogle Scholar
- Allen RH, et al.: Metabolic abnormalities in cobalamin (vitamin B12) and folate: deficiency. Faseb J 1993, 7(14):1344–53.PubMedGoogle Scholar
- Doolittle RF: Convergent evolution: the need to be explicit. Trends Biochem Sci 1994, 19(1):15–8. 10.1016/0968-0004(94)90167-8View ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in: large protein interaction networks. BMC Bioinformatics 2003, 4(1):2. 10.1186/1471-2105-4-2PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.