Functional clustering of yeast proteins from the protein-protein interaction network
© Sen et al. 2006
Received: 23 February 2006
Accepted: 24 July 2006
Published: 24 July 2006
Skip to main content
© Sen et al. 2006
Received: 23 February 2006
Accepted: 24 July 2006
Published: 24 July 2006
The abundant data available for protein interaction networks have not yet been fully understood. New types of analyses are needed to reveal organizational principles of these networks to investigate the details of functional and regulatory clusters of proteins.
In the present work, individual clusters identified by an eigenmode analysis of the connectivity matrix of the protein-protein interaction network in yeast are investigated for possible functional relationships among the members of the cluster. With our functional clustering we have successfully predicted several new protein-protein interactions that indeed have been reported recently.
Eigenmode analysis of the entire connectivity matrix yields both a global and a detailed view of the network. We have shown that the eigenmode clustering not only is guided by the number of proteins with which each protein interacts, but also leads to functional clustering that can be applied to predict new protein interactions.
Systems biology is a new frontier for bioinformatics research, aimed at understanding complex biological systems in cells by integrating interactions between large numbers of constituent components, including genes, proteins, and metabolites. Examples of systems biology research include studies of gene interaction networks [1–3], regulatory networks[1, 4–6], metabolic pathway modeling[7, 8], and combinations of these networks[3, 9–11]. By its nature, systems biology studies require highly detailed, large-scale simulations that are computationally demanding.
Proteins represent the major category of large functional biomolecules. How proteins interact with one another is a current subject of many high-throughput studies. The number of proteins in an organism can reach tens of thousands. Comprehending the functional, developmental, and regulatory networks comprising these temporal and spatial protein pairs is a formidable task [12–17] since the number of their pairwise combinations can reach millions.
Protein clustering in global interaction networks is important for revealing cellular functionality (for example ). Clustering usually involves defining one or more properties among samples and forming individual clusters based on the similarities of these properties, such as association with similar biochemical pathways (e.g. metabolic, signaling, regulatory), functional classification, cellular localization, or evolution (co-evolution, conservation, phylogeny). Usually, though, a combination of properties have been utilized [19, 20]. Clustering in part serves to detect incorrect annotations in databases (e.g. GO or KEGG), or to discover new connections in interaction networks. Clusters based on the topological information[21, 22] itself can also be useful to understand the organizational principles of interaction networks (not only biological, but also social networks) and to identify highly interconnected proteins with functional significance.
Our approach in this paper using connectivity matrix and subsequent eigenvalue/eigenvector decomposition is also based on the topological properties of the interaction network as a whole. Although significant proteins (in each eigenvector) form clusters, these clusters differ from those obtained by methods that are based solely on protein properties, because they reflect the organizational patterns of the protein interactions themselves based on topological considerations.
In the present study, we show that computational analyses of experimental data on protein networks can lead to discoveries of new, unexpected relationships, which emphasizes the importance of the global view of a protein interaction network.
In this paper, we have used spectral analysis of graphs methodology. We earlier applied a similar approach for protein dynamics analysis using elastic network models [25–29]. Spectral analysis has also been applied by Vishveshwara and co-workers to the problems of protein structure similarity, protein domain identification, and backbone clustering [30–35]. They have shown that important clusters in protein structures can be extracted from dominant eigenvalues of Kirchhoff's matrix , and that the vector components of the second lowest eigenvalue of the Kirchhoff's matrix of protein-structure similarity network leads to successful sub-clustering of functionally similar proteins . Biological networks were also extensively studied by Alon, Leibler and co-workers [36–50].
We used the yeast protein interaction data available in the GRID (General Repository for Interaction Data sets) database, which is a curated database of physical, genetic, and functional interactions encompassing many data sets [52–58]. The database contained 4906 proteins, and 19,037 interactions at the time we start our analysis. Most recently, this database has been updated by the addition of 753 new experimental interactions determined by Krogan et al. . Although the whole set of interactions is not curated and therefore includes some redundancies with previous entries in the database, it nevertheless reveal significant new interactions. This has created an excellent opportunity for us to test the predictive power of our method. Our preliminary studies based on the previous version of the database (not containing Krogan et al.'s results) show that our theoretical approach leads to correct predictions of some new protein-protein interactions provided in Krogan et al.'s data. Details of some of these successfully predicted cases will be shown here.
We have converted the pairwise interaction information obtained from the GRID data into a connectivity matrix C for subsequent analyses. The individual elements of the symmetric matrix C are as follows: 1, if two proteins interact; 0, if they do not; and the diagonal elements of the contact matrix are taken as the negative sum of the other row (or column) elements. We then applied the standard method of matrix eigenanalysis used in algebra. Readers who are non-experts in this field may read a brief tutorial provided in the Methods section.
We should note that our definition of the diagonal elements of the connectivity matrix as sums of all non-diagonal elements of the given column (or row), automatically leads to a connectivity matrix that is singular, and must be analyzed through the Singular Value Decomposition technique. The definition of diagonal elements of the matrix that implies its singularity has some deep physical meaning when the technique is applied to protein structures. For example in the case of elastic network models of proteins [25–29], or Gaussian model of polymer networks [60, 61], the zero eigenvalues are associated with the motion of the center of mass of the studied object.
In this work, we used SVD (since det C = 0) to extract eigenvalues and eigenvectors instead of alternative clustering methods. The SVD method has been extremely useful in the development of elastic network models[25, 26] used to compute protein motions (i.e. to divide the structure into domains(structure clusters) and in its applications, microarrays, as well as for analysis of other complex data. The SVD methodology has been recently used to study network-level analysis of metabolic regulation in the human red blood cells , detection of functional connectivities in cortical thickness , studies of transcription modules in large-scale gene expression data , or analysis of large-scale metabolomic networks [70, 71].
We have applied the SVD subroutine available in the LAPACK library to calculate all eigenvalues and eigenvectors of the connectivity matrix. This is a straightforward procedure and requires a relatively modest expenditure of computing time: All of 4906 eigenvalues (and corresponding eigenvectors) has been computed in 3 hours on a SGI Origin 2800 with 6GB of RAM. We have found that all eigenvalues are negative except for 43 zero eigenvalues. This behavior showing a relatively large numbers of zero eigenvalues typically implies a network with some sparsely connected nodes, as we have previously observed in our analysis of protein structures.
We should note that the same protein(s) may belong to several different clusters. This is because each cluster corresponds to an eigenvector related to a specific eigenvalue. Because the whole protein interaction network database for yeast contains 4906 proteins, there are also 4906 eigenvectors, and corresponding eigenclusters. Since each cluster contains at least several proteins, every protein belongs usually to several clusters. This corresponds to the situation in the normal mode analysis of protein motions, where a given residue can be involved at once in several functionally important motions, which may lead to functional promiscuity of proteins, where the same protein can have several different functions.
Interactome datasets contain protein interaction information obtained using a wide range of experimental methods, each providing data with differing reliability due to the limitations of the method used. One approach to reconcile the reliability of the protein interaction data obtained using various experimental methods is to assign weights to the interactions based on either the confidence of the particular experimental method, or the confirmation of interactions by additional experimental methods (e.g. ref [19, 74]). However, this approach also creates additional problems: not every protein interaction can be verified with multiple-experiments due to experimental limitations, and such weighting schemes might increase the unreliability of the data.
Classical wet-bench molecular biology approaches that focus on a single protein interaction are generally accurate. However, when a high-throughput method (e.g. the yeast two-hybrid assays) is used, the number of wrongly annotated interactions (i.e. false positives) increases, and sometimes, even some reported protein interactions cannot be reconciled with the known protein complexes.
The exact false positive rates and completeness of these large-scale experiments are relatively unknown because of coverage limitations: When Vidal and co-workers created random, exponential, power-law, and truncated normal topologies, they observed that these sampled maps were not characteristically different from those obtained using the yeast two-hybrid systems, suggesting that the current interaction maps may be much less complete than we previously thought. The incompleteness of the protein networks thus biases topology-based analyses.
Another limitation in protein networks is that global protein interaction networks present a rather static picture of protein interactions, neglecting transport and kinetic aspects. There are two distinct in vivo requirements for proteins to interact: first, two proteins need be in close proximity inside the cell; second, the kinetics of this interaction depends on their concentration and diffusion limitations. These limitations are coupled with other cellular processes regulating gene expression and utilizing embedded positive and negative control loops to ensure cell fitness. In the analysis of protein interaction networks, these issues are usually overlooked for practical reasons.
Despite these difficulties, computational analyses of protein interaction networks could be extremely useful, for example, if they can suggest new likely pairings that have not been yet discovered, or reveal new structural or functional linkages within clusters of proteins from the protein network.
The GRID database used for our calculations contains not only physical interactions, but also genetic and functional interactions. We should keep in mind that the available physical interactions in the database cannot always be understood in the sense that two molecules are selectively and specifically binding in vivo. For example, even the results of the yeast two-hybrid experiments that use hunt and bait plasmids may suffer from the presence of promiscuous hydrophobic patches on a protein surface, so an experimentally derived interaction may not correspond with certainty to an actual in vivo interaction. The definitions for genetic and functional interactions are even less strict: the interaction data obtained with the synthetic lethality experiments might instead indicate that two "interacting" proteins were only in closely related pathways/processes.
The rank of the average degree of connectivity for eigenvector clusters is shown in Figure 2 as a function of the eigenvector index. The rank grows almost linearly up to approximately the 2200th cluster, which is an interesting finding in itself. The remainder of the clusters contains proteins with small numbers of neighbors, and as a result the presence of noise disturbs the linearity of the plot. Figure 2 reveals that the singular value decomposition method clusters proteins according to their numbers of interactions (degree). In the spectral theory it is known that eigenvectors will cluster nodes with similar degrees of connectivity. The most crucial discovery in our work is that nodes that have similar degree of connectivity are highly likely to interact with each other. This observation might be possibly used in searches for missing protein-protein interactions.
The clusters revealed by the eigenanalysis show order and contain functional information. This is an important observation motivating further, more detailed studies. The distributions of eigenvalues of the original and the reduced matrices, in contrast to the shuffled matrices, are quite similar. This proves that despite possible experimental errors and many undiscovered interactions in the GRID database, the overall shape of the eigenvalue distribution and the resulting interaction clusters are conserved. This conservation can be exploited for predictive purposes.
The 5 smallest eigenvalues and the proteins related to the corresponding eigenvectors. The number of connections for each protein, and its rank order based on the number of connections are shown in the last two columns.
# of neighbors
A critical question remains: are these clusters formed solely according to the number of interacting proteins (i.e. spectral clustering)? Or does the function of proteins influence clustering (i.e. functional clustering)? The data we provide in this paper support the functional clustering hypothesis.
The spectral and functional nature of clusters may not be exclusive: their detailed nature could drive evolution in such a manner that the function of the protein is influenced not only by its functional type, but also by the number of protein neighbors in the whole network in order to create some vital control mechanisms to support cellular fitness. We will explore the presence of functional clustering in the following examples.
The significant proteins in eigencluster #23, their number of connections in the protein-protein interaction network, their corresponding eigenvalues, and GO molecular function annotations.
Number of connections
GO Molecular Function annotation
protein serine/threonine kinase activity
1,3-beta-glucan synthase activity
ATP binding; actin binding; structural constituent of cytoskeleton
molecular function unknown
cyclin-dependent protein kinase activity
cytoskeletal protein binding
The significant proteins in eigencluster #67, their number of connections in the protein-protein interaction network, their corresponding eigenvalues, and GO molecular function annotations.
Number of connections
GO Molecular Function Annotations
molecular function unknown
protein carrier activity
protein kinase CK2 activity
DNA-directed RNA polymerase activity
casein kinase activity
protein phosphatase type 1 activity
molecular function unknown
tRNA-intron endonuclease activity
There is also a question as to whether CSM3 may in fact be functionally disconnected from the other proteins in this cluster as suggested in the GRID database. Is it possible to functionally relate CSM3 to other proteins in the cluster? The function of CSM3 is currently unknown according to the GRID database, however, it is known that the protein participates in meiotic chromosome segregation and DNA replication. We cannot reach a definite conclusion as to whether CSM3 interacts with any other member of the cluster; the confirmation of these putative interactions must rely on future experimental studies, but the present analysis may be useful in suggesting this specific possibility out of the millions of others.
The significant proteins in eigencluster #4850, their number of connections in the protein-protein interaction network, their corresponding eigenvalues, and GO molecular function annotations.
Number of connections
GO Molecular Function Annotations
orotate phosphoribosyltransferase activity
molecular function unknown
structural constituent of ribosome
transferase activity, transferring phosphorus- containing groups
GO assignments of biological processes and molecular functions for examples of individual eigenclusters with FunSpec. The number of proteins with the same GO annotation is in parenthesis, and the numbers in square brackets are the p-values for the assignments.
Number of significant proteins
GO Biological Process
GO Molecular Function
DNA metabolism (10) [3 × 10-8], chromosome organization and biogenesis (7) [1 × 10-7], nuclear organization and biogenesis (7) [4 × 10-7], M phase (7) [3 × 10-6], cell organization and biogenesis (11) [4 × 10-6]
Double-stranded DNA binding (2) [2 × 10-4], single-stranded DNA binding (2) [4 × 10-4], DNA helicase (2) [1 × 10- 3], DNA binding (5) [3 × 10-3], Binding (8) [5 × 10-3]
Cell growth and maintenance (50) [9 × 10-8], RNA metabolism (14) [2 × 10-7], RNA processing (13) [5 × 10-7], microtubule-based process (8) [8 × 10-7], mRNA processing (9) [9 × 10-7], nucleobase, nucleoside, nucleotide, and nucleic acid metabolism (24) [2 × 10-6]
Binding (28) [1 × 10-8], nucleic acid binding (21) [2 × 10-7], RNA binding (13) [2 × 10-7], mRNA binding (6) [5 × 10-5]
Nucleobase, nucleoside, nucleotide, and nucleic acid metabolism (36) [1 × 10-14], RNA processing (19) [1 × 10-13], mRNA processing (14) [4 × 10-13], RNA metabolism (19) [1 × 10-12], RNA splicing (12) [6 × 10-11], mRNA splicing (11) [1 × 10-10], metabolism (44) [2 × 10-10], cell growth and/or maintenance (49) [7 × 10-10]
Binding (32) [8 × 10-13], nucleic acid binding (25) [2 × 10- 11], RNA binding (14) [7 × 10-9], mRNA binding (6) [4 × 10-5]
Another interesting aspect of this eigencluster shown in Fig. 8 is related to the unconnected proteins: GO annotations show that the unconnected NUM1 (YDR150W), BEM2 (YER155C), and TOP1 (YOL006C) not only take part in cell organization and biogenesis, but also in cell cycle defects. Specifically, NUM1 and TOP1 are the 2 proteins of only 12 proteins assigned to nuclear migration. Therefore, although the presence of interactions is not experimentally verified for these three proteins, they relate closely in function to other proteins in their roles in cell organization. Experimental data indicate that SGS1 can be essential  in the absence of TOP1, so possibly these two proteins may substitute functionally for one another, thereby suggesting the additional interactions shown by dotted lines. Further experimentation is needed to test possible interactions of the unconnected KRE11 with other proteins in this eigencluster.
In this eigenvector, all significant proteins except TOP1 and ARP1 are connected to each other forming a full interactive cluster according to older GRID yeast data. According to our functional clustering hypothesis, these two proteins should, however, be connected to the interactive module of other proteins in the cluster. Was this discrepancy due to limitation of our clustering hypothesis or the lack of data? The new interaction data from Krogan et al.  have confirmed one of our theoretical predictions, since the protein TOP1 (YOL006C) is indeed connected to the interactive module (the new interaction is shown as a heavy line). This is a clear example of the unifying functionality in clusters, supporting our view that the number and the topology of interactions of a given protein are related to its functional role in the cellular processes and that this functionality can be exploited to predict new interactions between proteins found within the same cluster. We also expect the protein ARP1 (YHR129C) in Fig. 9 to be connected to this interaction network, a claim yet to be substantiated by future experiments. There are other cases from the newest data set that confirm the correctness of our prediction methodology (not shown). We are expecting that more of these predictions may be confirmed in the future, since the yeast protein interaction data set is far from being complete. We should note that in a recent paper of Uri Alon's group  it has been shown that evolutionarily developed rules of biological regulation are based on error minimization. It is quite possible that functional clusters relate to the error minimization problem, by allowing functionally related proteins to replace each other in multi-regulatory systems.
We have analyzed the yeast protein interaction network by building a connectivity matrix and by applying singular value decomposition to obtain eigenvectors. We have observed that significant proteins in each eigenvector not only have similar degrees, but also are most likely to interact with each other. These proteins therefore form "functional clusters", and these clusters can guide future experiments to predict new interactions. More detailed interpretations of these networks can be obtained by further studies utilizing information about protein structures. Our method can be especially useful for larger, more complex organisms where collection of the protein interaction data is more complicated. Our results encourage further analyses to confirm that functional clusters detected by our method reflect the modular nature of protein interaction networks and originate from evolutionarily preservation of cellular fitness.
For a given square matrix A of size N × N the eigenvalues λi and eigenvectors x i (1 ≤ i ≤ N) of size N correspond to the solution of the equation
Ax = λx (1)
The equation Ax = λx represents a concise notation of system of linear equations, that have nontrivial solutions only if the determinant
det (A - λI N) = 0 (2)
where I N is the identity matrix of size N × N. This is satisfied only for certain values of λ, called eigenvalues, which are roots of the characteristic equation of A (that is a polynomial of degree N in λ). For each eigenvalue λi (1 ≤ i ≤ N) there is a corresponding eigenvector x i that satisfies the equation Ax i = λi x i. If some eigenvalues of the matrix A are zeros, than the matrix A is singular, its determinant det A = 0, and generally the inverse matrix A -1 that satisfies the relation AA -1 = A -1 A = I N does not exist. A standard mathematical approach to deal with such cases is the computation of the matrix pseudoinverse by using singular value decomposition method, which will be discussed in the next sub-section.
Generally, any matrix A of size M × N (with M ≥ N) can be written as a product
A = UΛV T (3)
where Λ is the square matrix of size N × N containing non-negative values λ1, λ2, ...λN at the diagonal and zeros off-diagonal, and U and V are matrices of sizes M × N and N × N, respectively, that have orthogonal columns, i.e. and
It can be shown that the original contact (connectivity) matrix C = [Cij] for the protein network can be written as
C = U T ΛU (4)
where Λ is the diagonal matrix containing eigenvalues λ1, λ2, ...λN of C, and U is the matrix formed from eigenvectors of C. Thus, the elements Cij of the contact matrix C can be expressed as
where u ki denotes the i th component of the eigenvector corresponding to the k th eigenvalue. Equation 5 can be viewed as the eigenvalue expansion of the contact matrix. From Eq. 5 it follows:
The eigenvalues with the smallest indices (that correspond to the largest absolute values of λ, as seen in Fig. 1) make the largest contributions, and higher indexed eigenvalues contribute successively less (Eq. 6). We clearly see that the total number of contacts for nodes in the network (especially for those that have the highest connectivities) can be well approximated by a relatively small number of the most dominant eigenvalues, because the majority of eigenvalues shown in Fig. 1 are close to zero and do not provide any significant contributions (Eq. 6).
The authors acknowledge the financial support provided by the NIH grants R01GM072014 and R33GM066387. The authors would also like to thank James C. Coyle for his assistance with LAPACK.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.