- Research article
- Open Access
Usefulness and limitations of dK random graph models to predict interactions and functional homogeneity in biological networks under a pseudo-likelihood parameter estimation approach
© Wang et al; licensee BioMed Central Ltd. 2009
Received: 9 February 2009
Accepted: 3 September 2009
Published: 3 September 2009
Many aspects of biological functions can be modeled by biological networks, such as protein interaction networks, metabolic networks, and gene coexpression networks. Studying the statistical properties of these networks in turn allows us to infer biological function. Complex statistical network models can potentially more accurately describe the networks, but it is not clear whether such complex models are better suited to find biologically meaningful subnetworks.
Recent studies have shown that the degree distribution of the nodes is not an adequate statistic in many molecular networks. We sought to extend this statistic with 2nd and 3rd order degree correlations and developed a pseudo-likelihood approach to estimate the parameters. The approach was used to analyze the MIPS and BIOGRID yeast protein interaction networks, and two yeast coexpression networks. We showed that 2nd order degree correlation information gave better predictions of gene interactions in both protein interaction and gene coexpression networks. However, in the biologically important task of predicting functionally homogeneous modules, degree correlation information performs marginally better in the case of the MIPS and BIOGRID protein interaction networks, but worse in the case of gene coexpression networks.
Our use of dK models showed that incorporation of degree correlations could increase predictive power in some contexts, albeit sometimes marginally, but, in all contexts, the use of third-order degree correlations decreased accuracy. However, it is possible that other parameter estimation methods, such as maximum likelihood, will show the usefulness of incorporating 2nd and 3rd degree correlations in predicting functionally homogeneous modules.
High throughput technologies such as microarrays and yeast-two-hybrid assays have resulted in an explosion of biological data that can be represented as networks. For example, microarray datasets can be analyzed as a coexpression network, in which nodes (or vertices) represent genes and links (or edges) represent coexpression, the similarity of the level of expression of two genes over the samples in the study. Similarly, protein interaction data, such as that generated by yeast-two-hybrid assays, can be summarized as a network, with nodes representing proteins and edges representing physical interaction between two proteins.
Genes and their products give rise to biological function through their interaction with each other and with other components of the cell. The analysis of the above biological networks is therefore the natural way to understand cellular function on a genome-wide level. In particular, we need a thorough understanding of the statistical properties of biological networks if we aim to make inferences, such as inferring evolutionary relationships between various networks, or separating signal from noise in imperfect network data.
Erdős and Rényi  were the first to study the statistical properties of random graph models. In their models (now known as ER models), any edge between two vertices occurred independently of other edges with a constant probability p. In these graphs, however, the degree of a vertex (the number of links to other vertices) is a random variable with an approximately Poisson distribution with λ = (n - 1)p, which is grossly at odds with most biological network observed to date [2, 3].
In real biological data, node degrees usually have heavy tail distributions [2, 3]. Accordingly, in most statistical studies of biological networks, the null model is a random graph from the set having a degree distribution identical to that of the data, or a distribution in which the expected degrees are identical to those observed in the data .
These models are themselves limited, because in addition to their degree distributions, biological networks show highly clustered connections  and transitivity . Indeed, it is difficult to assess which properties of a network would represent sufficient statistics that are biologically meaningful.
Mahadevan et al.  attempted to solve this problem by devising an increasing series of random network models they referred as the dK-series. The distributions of the random networks are defined as uniform over the set of graphs having the same distribution of d-sized subgraphs as the observed network data. Particular cases of the series reduce to familiar distributions: the 0K distribution P0 is identical to the corresponding ER distribution, which describes the average number of links per node. The 1K distribution model tells us the expected degree of each node and assumes that the nodes are randomly connected conditional on the expected node degrees. The 2K distribution P2 describes the interconnectivity of nodes with given degrees, maintaining the number (m(k, k')) of links between nodes of degrees k and k'. The 2K distribution therefore preserves degree-degree correlations between nodes (known as the assortativity of the network). Including still more connectivity information, the 3K distribution considers degree correlations among any 3 nodes, which include the transitivity of the network. Moving beyond pairs of nodes, various topological structures are possible. For example, there are 8 different kinds of isomorphic structures for the 3K distribution. Increasingly larger subgraphs can be enumerated for d = 4,5,..., capturing increasingly complex features of a particular graph.
The dK-series is therefore an objective way to progressively include more features into a random graph model, just as each term in a Fourier or Taylor series progressively captures more details of a given function, and thus largely avoids the arbitrary selection of statistics that may or may not be sufficient or relevant to a particular process.
Using this series as our starting point, we sought to evaluate the use of ever more inclusive dK distributions in the study of biological networks. For four different biological networks, we trained dK models for d = 0, 1, 2, 3. We first explored the properties of the models by evaluating their ability to predict in the observed networks the presence or absence of individual edges, as well as general network statistics. We showed that the 2K model outperforms other models in predicting the presence and absence of the edges for both protein interaction and gene coexpression networks.
We then evaluated whether statistical significance against one of the models for subnetworks corresponded to biological significance. We modeled our approach based on the scoring scheme used by Tanay et al. [8, 9]: they devised a pseudo-likelihood score for edges in a bipartite graph of genes and samples, in which edges occur according to a null model that corresponds to the 1K distribution, or to an alternative model, representing biological significance, independently with a high constant probability p = 0.9. They showed that this score results in improved accuracy in predicting functional gene groups, when compared with network density alone (which is equivalent to using 0K as the null model).
We reproduced their score using as the null distribution one of dK models for d = 0, 1, 2, 3. We aimed to test the hypothesis that more inclusive distributions would result in a score for a set of nodes that is more indicative of biological significance, just as 1K was in the case of bipartite graphs. We were surprised, however, to find that accuracy was only slightly increased with each successive dK distribution in the case of yeast protein interaction networks, while the 0K distribution (equivalent to edge density) had the best predictive power in coexpression networks.
Results and Discussion
In this section, we first give a brief discussion of the dK models and the pseudo-likelihood methods for estimating the parameters. Next we study the accuracy of the dK models in predicting edges in the molecular networks. Then several statistics related to the networks are studied to evaluate if the random networks can approximate the observed networks. Finally, we evaluate if the dK models can be used to identify functionally homogeneous modules.
Model description and parameter estimation
For each network, we created random graph models matching the 0K, 1K, 2K, and 3K distributions of the observed network. For the 0K and 1K models, we took the degree sequence and number of edges as fixed properties of the network and thus defined the matching models. In the 0K model, each edge occurs independently with probability , |E| being the number of edges and |V| the number of nodes in the real network. In the 1K model, each edge occurs independently, conditional on the degrees k1 and k2 of its incident nodes, with probability p(k1, k2) = min (k1k2/(2·|E|), 1 - ε), for ε small (in our case, 10-4). For the 2K model, we calculated the probability p(k1, k2) that two proteins with degree pairs (k1, k2) interact. One intuitive approach is to estimate p(k1, k2) by the fraction of interacting protein pairs among all the protein pairs with degrees (k1, k2), k1 ≤ k2. However, for many degree pairs (k1, k2), the number of such protein pairs is small. Thus, the estimated value of p(k1, k2) using this intuitive approach is not reliable. To overcome this problem, we modeled p(k1, k2) as a function of (k1, k2) and fitted the function using Matlab. Details of the estimation method are given in the "Methods" section.
The performance of predicting protein interactions using the dK distribution models
Comparing statistical features of random networks from the dK models with that of the observed networks
We next studied if the random networks based on the dK distribution models approximate the observed interaction networks. To achieve this objective, we generated 100 random networks based on the dK distribution models and calculated several statistical features of the random networks (see "Methods" for details). We studied five network statistical features as in :
λ1: average of the smallest eigenvalue of the Laplacian of the graph matrix;
λn-1: average of the largest eigenvalue of the Laplacian of the graph matrix;
d: average shortest distance between the nodes;
σ d : standard deviation of shortest distance between the nodes;
r: average assortativity coefficients.
The elements of the Laplacian matrix of a network are defined by if node i with degree k i and node j with degree k j are connected and l ij = 0 otherwise for i ≠ j, and l ij = 1 if i = j. Several other important network statistics , e.g. network resilience and performance, are tightly controlled by the smallest non-zero (λ1) and the largest (λn-1) eigenvalues of the Laplacian matrix. Therefore, we studied whether the corresponding eigenvalues of the dK random networks are close to that of the true network. In addition, the distribution of the shortest distances between any two nodes provides information on how the nodes cluster together in the network. We used two quantities, the mean and standard deviation of the shortest distances, to characterize this distribution. Finally, the assortativity coefficient of a network provides information on how nodes of different degrees link to each other. Although these five network statistical measures cannot fully describe the network of interest, they capture important network properties. If the dK distribution models can approximate the true network well, these quantities in the dK random networks should be close to the corresponding values of the true network.
Comparison of five network features for the dK distribution models with that of the MIPS protein interaction network, d = 1, 2, 3.
Comparison of five network features for the dK distribution models with that of the GDS1013 coexpression network with PCC cut-off threshold of 0.89, d = 1, 2, 3.
The performance of the dK distribution models for the identification of functionally homogeneous modules
Our primary motivation of this study is to see if the more complex models, which can generally more accurately describe the observed network, are helpful in the identification of biologically functionally homogeneous modules. Statistical deviations from a suitable model would indicate evolutionary pressure and thus functional significance. Therefore, we can compare the functional relevance of each model by how well statistical deviations from the model correlate with the functional homogeneity of the corresponding nodes.
We designed scores from our models based on an alternative hypothesis that edges are present in a functional module with constant probability p. Generally p should be close to 1 as most functionally homogeneous modules are highly clustered. As in [8, 9], we chose p = 0.9 in the main text. To see the validity of our results for different values of p, we also changed p to p = 0.85 and p = 0.95. Our approach is similar to that used by Tanay et al [8, 9], which they used a single null model (a graph is chosen at random from the set of networks having identical degree sequence to the original network, equivalent to our 1K model) in the context of a bipartite graph. With this score framework, we used a simulated annealing algorithm to find groups of genes with high scores, retaining every group encountered during the run of the algorithm and their scores under each of the null models.
Finally, we called a gene group functionally homogeneous if it was enriched in at least one functional category from the Gene Ontology . We defined module enrichment by the hypergeometric test p-value, with a threshold of p < 10-5. These gene groups were taken to be true positives, and the remaining gene groups were taken to be true negatives. Again, we varied this threshold from 0.01 to 10-6, and no qualitative changes in the results were observed, showing that our approach is robust to the parameter for calling functional homogeneity (data not shown). We then evaluated the four models by comparing how well they can predict functional homogeneity in the MIPS and BIOGRID yeast protein interaction networks, or in two different yeast gene coexpression networks, GDS1013 and GDS1103.
Results based on the MIPS protein interaction network
Spearman correlation between the scores of the gene groups for different dK distribution models based on MIPS protein interaction data.
p = 0.9, gs = 10
p = 0.9, gs = 8
p = 0.85, gs = 10
p = 0.95, gs = 10
Results based on gene coexpression networks
Spearman correlation between the scores of the gene groups for different dK distribution models based on the GDS1013 coexpression network with PCC cut-off threshold of 0.89, d = 1, 2, 3.
p = 0.9, gs = 10
p = 0.9, gs = 8
p = 0.85, gs = 10
p = 0.95, gs = 10
We also performed the same analysis for another gene expression dataset, GDS1103 . The performance results for this dataset are presented in Additional file 1: Figures S19-S26 and Additional file 2: Tables S7-S8. It should be noted that none of the dK models performed well in identifying functionally homogeneous modules based on this gene expression data set. One potential reason is that the number of sampling points is only 11, which is much smaller than that of GDS1013, which has 24 sampling points. Thus the network constructed based on GDS1103 may not be reliable. Despite the drawbacks of this dataset, the conclusions from this dataset is qualitatively identical to those found for GDS1013. This demonstrates the generality of our conclusions with respect to gene coexpression networks.
We studied the ability of dK distribution models to predict individual edges and functionally homogeneous modules in protein interaction and gene coexpression networks. A pseudo-likelihood logistic estimation method was proposed to estimate the parameters in the dK distribution models. We found that the 2K distribution model performs the best in predicting individual edges in both protein interaction and gene coexpression networks. A pseudo-likelihood ratio score function was then defined to evaluate potential functional homogeneity based on the dK distribution models. For yeast protein interaction networks, 1K, 2K and 3K models perform similarly and are slightly better than the 0K model in predicting functionally homogeneous modules. The dK scores were very highly correlated for different d. This means that, between two different subgraph topologies, the variation in the denominator, the dK distribution likelihood, was small relative to that in the numerator, the constant-p likelihood. In this case, most of the variation in scores between modules would be accounted for by the numerator. The different probabilities between 1K, 2K, and 3K may be similar overall in the networks we studied. For gene coexpression networks, the 0K model performs significantly better than the other models in predicting functionally homogeneous modules. We noted that 0K, or density, performed remarkably well as a prediction method even in the yeast protein interaction network, being able to find extremely functionally homogeneous groups of genes (p < 10-5). This may simply reflect that highly dense subnetworks in a protein interaction network represent protein complexes, which are of necessity functionally homogeneous.
One future avenue of research could be to remove this type of functionally homogeneous modules from the data, since they are relatively uninteresting examples of functional homogeneity. It may be that the subtle differences between the various dK distributions are useful to pick out homogeneous modules of more specific functions.
We downloaded yeast protein interaction data from two different data sources: MIPS  and BIOGRID . The MIPS (Munich Information Center for Protein Sequences) dataset (version: PPI_18052006.tab) contains 12,319 protein physical interactions involving 4,546 proteins. The BIOGRID dataset (version 2.0.51) contains 91,364 protein physical interactions involving 5,563 proteins.
We also studied two gene expression datasets GDS1013 and GDS1103 downloaded from the NCBI Gene Expression Omnibus . The GDS1013 expression data contains the expression profiles of about 6400 yeast genes and open reading frames by over-expressing the essential ribosomal protein activator IFH1. Twenty four samples of 2 growth protocols, 2 strains, and 5 time points were studied. Two genes are referred as linked if the Pearson correlation coefficient between the expression levels of every pair of genes is at least 0.89. The GDS1103 expression data studied the gene expression profiles of 6400 genes of leu3 mutant grown in either limited ethanol or limited ammonium media. Twelve samples involving 2 genotypes and 2 growth protocols were studied. To study the effect of different thresholds for coexpression in defining the networks, we used two threshold values 0.89 and 0.93 for the PCC between the gene expression profiles. In the main text, we provide our results based on the MIPS protein interaction data and the GDS1013 co-expression network with PCC cut-off threshold of 0.89. The results for the other networks are presented in the additional files.
Model fitting for the dK models
In this paper, we choose ϵ = 10-4.
In the 2K model, we parameterized as follows the probability
Based on the observed interaction network, we used logistic regression in Matlab to estimate the parameters (α, β, γ).
where p i = P(topo(X) = i). We estimated the parameters by maximizing the pseudo-likelihood of the data. The pseudo-likelihood Q = Q(α1,⋯,α7; β7,⋯,β7) α i ∈ R, β i ∈ R3, 1 ≤ i ≤ 7 is defined by multiplying the probability of the observed categories across all the triplets.
Calculate 2K distribution from 3K distribution
When we evaluate the ability of the 3K distribution model for predicting protein interactions in the next subsection, we need an equation linking the probability for two nodes to be connected based on the probabilities of the seven topologies given in Figure 1. The equation is given as follows.
Evaluation of the dK models for predicting protein interactions
We studied the relationship between the accuracy and the cut-off threshold for the predicted probability of interactions, between the false positive rate and the true rate (the ROC curve), and between precision and recall.
Random network simulation with dK distribution models
We generated 100 random networks based on the dK distribution models using the estimated parameters to see if the random networks approximate the observed network well. For each network, we calculate five statistics: the smallest eigenvalue of the Laplacian of the graph matrix, the largest eigenvalue of the Laplacian of the graph matrix, the shortest distance between the nodes, the standard deviation of shortest distance between the nodes, and the average assortativity coefficients as in . These statistics give approximate description of the networks of interest. If the random networks approximate the true network well, these statistics should be close to the corresponding values of the true networks. The simulation steps were carried out as follows.
where n l (i, j, k) is the number of occurrences of topology l (Figure 1) between triples of nodes with degrees i, j and k in the observed network, ñ l (i, j, k) is the predicted number of occurrences of topology l between triples of nodes with degrees i, j and k, and l (i, j, k) is the corresponding observed number in the randomized network. The detailed simulated annealing procedure is as follows.
Each state is a network. We find an initial state by rewiring the original network 10,000 times (preserving only the degree distribution). We then continue rewiring, but now we accept only resulting networks with lower energy scores, or with higher energy scores with probability , where d = 2 or 3, and T is the temperature, which we decrease as T k = α Tk-1, with T0 = 1 and α = 0.995. We ran the simulated annealing for 50,000 iterations.
Evaluation of model performance in identifying functionally homogeneous modules
Having defined a score function, we searched for modules of constant size and high score using a simulated annealing approach.
Evaluation by functional homogeneity prediction
Given a set of gene modules (groups of genes) and their score in a network obtained by each of the four models, we measured model performance as follows. We tested the genes for enrichment in one or more functions in the "biological process" category of the Gene Ontology . If the gene module showed a hypergeometric test p-value of less than 10-5 (as we previously mentioned, the exact value is not critical to the results), we declared it "functionally homogeneous". This gave us True Positive and True Negative sets. We then tested how well a particular score could predict these categories by comparing Accuracy, ROC and Precision-Recall curves for each model.
Simulated annealing search for high-scoring modules
We used the simulated annealing technique, described by Kirkpatrick , with the following definitions: A state is a subset of nodes from the network of fixed size n. The state space is therefore the set of all n-sized subsets of nodes from the full set of nodes of the network. The energy of a state is the negative of the pseudo-log-likelihood score described in equation (7) for 0K-2K models and equation (9) for the 3K model. A neighboring state is a subset that differs in exactly one member.
We ran the algorithm as follows:
Set the initial temperature for the algorithm.
Select a random set of n nodes, S0, to be the current state.
While the temperature is less than a specified minimum temperature, perform the following steps:
This research was supported by the National Natural Science Foundation of China grant 10671110, the National Basic Research Program of China (973 Program, No. 2007CB814901)(YL) and by US NIH R21AG032743 (FS). We thank the anonymous reviewers for excellent suggestions that significantly improved the presentation of the paper.
- Erdös P, Rényi A: On Random Graphs. Publicationes Mathematicae 1959, 6: 290–7.Google Scholar
- Barabasi AL, Albert R: Emergence of scaling in random networks. Science 1999, 286(5439):509–12. 10.1126/science.286.5439.509View ArticlePubMedGoogle Scholar
- Barabasi AL, Bonabeau E: Scale-free networks. Sci Am 2003, 288(5):60–9.View ArticlePubMedGoogle Scholar
- Chung F, Lu L: Connected components in random graphs with given expected degree sequences. Annals of Combinatorics 2002, 6: 125–45. 10.1007/PL00012580View ArticleGoogle Scholar
- Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393(6684):440–2. 10.1038/30918View ArticlePubMedGoogle Scholar
- Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U: Superfamilies of evolved and designed networks. Science 2004, 303(5663):1538–42. 10.1126/science.1089167View ArticlePubMedGoogle Scholar
- Mahadevan P, Krioukov D, Fall K, Vahdat A: Systematic Topology Analysis and Generation Using Degree Correlations. SIGCOMM 2006, 36: 135–46. 10.1145/1151659.1159930View ArticleGoogle Scholar
- Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18(Suppl 1):S136–44.View ArticlePubMedGoogle Scholar
- Tanay A, Sharan R, Kupiec M, Shamir R: Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA 2004, 101(9):2981–6. 10.1073/pnas.0308661100PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, Dietmann S, Frishman D, Gregory R, Mannhaupt G, Mayer KF, Munsterkotter M, Ruepp A, Spannagl M, Stumpflen V, Rattei T: MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res 2008, (36 Database):D196–201.Google Scholar
- Barrett T, Troup D, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, R E: NCBI GEO: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Res 2006, 35: D760–5. 10.1093/nar/gkl887PubMed CentralView ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: Biogrid: A General Repository for Interaction Datasets. Nucleic Acids Res 2006, 34: D535–9. 10.1093/nar/gkj109PubMed CentralView ArticlePubMedGoogle Scholar
- Consortium TGO: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25: 25–9. 10.1038/75556View ArticleGoogle Scholar
- Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by Simulated Annealing. Science 1983, 220(4598):671–80. 10.1126/science.220.4598.671View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.