Generating confidence intervals on biological networks
© Thorne and Stumpf; licensee BioMed Central Ltd. 2007
Received: 12 June 2006
Accepted: 30 November 2007
Published: 30 November 2007
In the analysis of networks we frequently require the statistical significance of some network statistic, such as measures of similarity for the properties of interacting nodes. The structure of the network may introduce dependencies among the nodes and it will in general be necessary to account for these dependencies in the statistical analysis. To this end we require some form of Null model of the network: generally rewired replicates of the network are generated which preserve only the degree (number of interactions) of each node. We show that this can fail to capture important features of network structure, and may result in unrealistic significance levels, when potentially confounding additional information is available.
We present a new network resampling Null model which takes into account the degree sequence as well as available biological annotations. Using gene ontology information as an illustration we show how this information can be accounted for in the resampling approach, and the impact such information has on the assessment of statistical significance of correlations and motif-abundances in the Saccharomyces cerevisiae protein interaction network. An algorithm, GOcardShuffle, is introduced to allow for the efficient construction of an improved Null model for network data.
We use the protein interaction network of S. cerevisiae; correlations between the evolutionary rates and expression levels of interacting proteins and their statistical significance were assessed for Null models which condition on different aspects of the available data. The novel GOcardShuffle approach results in a Null model for annotated network data which appears better to describe the properties of real biological networks.
An improved statistical approach for the statistical analysis of biological network data, which conditions on the available biological information, leads to qualitatively different results compared to approaches which ignore such annotations. In particular we demonstrate the effects of the biological organization of the network can be sufficient to explain the observed similarity of interacting proteins.
Large-scale protein interaction network (PIN) data have now been collected in a number of prokaryotic and eukaryotic species. It has been suggested that these networks provide an integrative perspective on cellular processes and considerable effort has been invested into their functional and evolutionary analysis[1–6]. At the moment molecular network data sets are still plagued by noise  – this is especially true for protein interaction networks – and incompleteness , but nevertheless considerable progress is being made in the analysis of complex cellular phenotypes in light of such networks. Below we will introduce an novel method for the construction of confidence intervals for network quantities. This new approach is able to fuse different lines of biological information and generate conditional confidence intervals; these can be applied as an alternative we can employ it in addition to demonstrate it in an analysis of the Saccaromyces cerevisiae PIN.
A number of studies have investigated (i) whether characteristics of interacting proteins are more similar than those of proteins for which no interaction has been reported [9, 10], and (ii) how the network structure affects properties – such as the evolutionary rate [6, 11–13] – of interacting genes or proteins. Other studies have looked at structural properties, such as network motifs [14–16]. Because of the dependence introduced by the network it is, however, not possible to use the conventional confidence measures, e.g. for the correlation coefficient of some property of pairs of interacting proteins. Rather, a network-aware Null model has to be used which compares the actual network with some suitably randomized version of it. In order to incorporate network aspects these studies have used either (i) straightforward bootstrapping of nodes in order to create random pairs of nodes (such as proteins) [9, 10], (ii) bootstrapped nodes based on their degree , or (iii) randomly rewired the network while keeping the degree of each node fixed [14, 15, 17] (see Methods section for details). The first approach has been shown to underestimate the size of the confidence intervals (CI) , while the second and third yield statistically similar results (CIs are also broader than for (i)) for measures of pair-wise similarity of the properties of interacting nodes. In order to assess the CIs for motifs, however, an explicit incorporation of the network is generally necessary and only the third approach can be used.
All of these three approaches above rely, however, implicitly on the assumption that the network is uniform and not structured in any particular way. Such procedures also ignore any other information that is increasingly becoming available for many species [18–20], and which may affect the organization of the network. While bootstrap (as long as the degree sequence is accounted for either exactly or statistically) or rewiring approaches are parsimonious – and undoubtedly should be preferred for general cases where no other information is available – it opens up the question as to whether such approaches are still satisfactory when additional, potentially co-variate, data is available.
Here we provide statistical tools for incorporating such additional information into the statistical analysis. Using such information can have considerable effect on the construction of network confidence intervals, and our procedure, GOcardShuffle, constructs randomly rewired instances of networks that incorporate the degree sequence exactly, and additional information statistically (based on a rejection-sampling algorithm). Thus, for example, if in a given protein interaction dataset, proteins in the mitochondria interact predominantly with other mitochondrial proteins but not at all with proteins in the endoplasmic reticulum, then GOcardShuffle will construct only instances of randomly rewired networks which reflect the relative importance of intra-category versus inter-category connections. In addition to GO annotations, any other biological annotation (e.g Enzyme Commission numbers or protein domain information) may act as confounding variables, e.g. when expression levels differ between categories.
There is a rich statistical literature on confounding variables and their role in the statistical interpretation of primary effects. Scenarios, where the effects of known or unknown confounding variables result in inconsistencies unless properly accounted for, are known as examples of Simpson's paradox in statistics. On a much more subtle scale there will undoubtedly be confounding variables in many of the processes studied in systems biology. These, at least in principle, can be accounted for in a framework such as GOcardShuffle; if the approach (implemented in Python and R) is used in addition to random rewiring then it may be possible to detect such potentially confounding hidden variables.
Here we illustrate the use of GOcardShuffle by contrasting statistical confidence intervals obtained under different Null models for network rewiring.
Correlation of node properties
In both cases, however, we notice that the full annotation as used in GOcardShuffle results in distributions of correlation values that cover the observed value of the correlation. Thus, once the rewired network instances are conditioned on GO annotations the observed correlation appear to be covered by the new, conditioned Null model. In Figure 1 in Additional File 1 we show that the effects of conditioning on presently available functional information in the context of presently available protein interaction data does result in a shift of the distribution obtained under the Null model away from zero to finite positive values. Depending on the dataset and correlation measure, however, the GOcardShuffle histogram may not overlap the observed value (see Additional File 1).
Quite generally, we expect that conditioning such analyses on additional available data (of which increasing amounts are becoming available) will result in a shift in the expected Null distribution if such data does explain some aspects of the variability in the measures to be correlated. That is, we observe the shift in the Null distributions, precisely because some of the variation in evolutionary rate and mRNA expression levels are captured by GO annotations [6, 21].
We have shown that it is possible to condition the rewiring process by which confidence intervals on networks are constructed on biological information such as gene ontology data. Integrating such known biological information into the statistical analysis of protein interaction network data may result in changes to the Null model if such data is correlated with network organization. We demonstrated the effect of conditioning on GO data by analyzing the correlations among interacting proteins: several studies had reported that properties of interacting proteins are significantly more similar than those of non-interacting proteins. Applying GOcardShuffle to yeast PIN data and conditioning on different combinations of GO categories suggests that this may at least partially be because the protein interaction networks of real biological organisms are inhomogeneous and show a level of local and functional organization, which has been ignored in previous statistical analyses. In light of the conditional Null models, however, the similarity of evolutionary rates and expression levels of interacting proteins in the Yeast PIN dataset used here, is just as would be expected for a network with the same biological characteristics (as captured by present biological annotations). Since these protein characteristics differ between different categories  – even if sometimes only slightly – and since within-category interactions are more frequent than between category interactions, similarity of properties of interacting proteins are readily understood.
Presently GO annotations have to be treated with some care and caution. There is the danger of circular arguments if in-silico annotations (which often rely on protein-protein interactions) are used. As we outline in the Methods section uncertainties and different levels of support for different annotations are straightforwardly incorporated into the GOcardShuffle algorithm.
The source code of the Python and R routines of the implementation of GOcardShuffle are available from the authors' website .
Our novel network resampling approach allows the construction of confidence intervals under a statistical Null model of network organization which conditions on available biological information. If used in addition to conventional rewiring procedures then this approach can be used to detect potentially confounding hidden variables or relationships in systems biology data.
GOcardShuffle allows the refinement of the statistical Null model for network structure based on available biological data: the rewired network instances may now capture probabilistically the modular aspects of these molecular networks (if the annotations imply such a structure). This appears to be the case for GO annotations of yeast proteins, and as we have shown, such stratification of the network – where within-category interactions are more frequent than between-category interactions – may lead to correlations among properties of interacting proteins. Once this has been accounted for, there is no strong additional evidence for interacting proteins to be more similar than would be expected by chance. The present approach is readily extended to include other information on functional and structural properties of the network. Quite generally, GOcardShuffle, can be applied in the statistical analysis of coloured graph problems.
Constructing confidence intervals for networks
Given a reported network dataset (which will at present generally be plagued by false-positive and false-negative results [7, 30], as well as incompleteness ) we wish to be able to evaluate the statistical significance of some network statistic. To this end we need to construct networks which share some characteristics of the observed network; as we have shown above, the choice of the information we choose to use to generate such rewired networks can have a pronounced effect on the results of the statistical analysis.
Previous approaches: unconditional procedures
Depending on whether the similarity of properties of interacting proteins or the abundance of network motifs were considered, previous approaches assessed statistical significance either through a bootstrap or randomization procedure, or by rewiring the network. In the former authors either picked M pairs of interacting proteins by randomly sampling 2M proteins with replacement from the N proteins in the dataset , shuffled the list of interaction partners , or picked proteins proportional to their degree . The latter two approaches conserve the degree sequence exactly and probabilistically, respectively; for the first approach, on the other hand, it is straightforward to show, that this corresponds to making the assumption that the Null model is a classical or Erdös-Rényi random graph (and is therefore inappropriate for the analysis of real networks). The sample statistic (such as a correlation coefficient) is then calculated for each replicate to generate the distribution under the Null model.
Rewiring of the network involves breaking up all interactions and leaving a number of "stubs" at each node corresponding to its degree. Randomly chosen pairs of stubs are then connected until all M interactions have been created and the summary statistic (correlation coefficient or number of motifs) is calculated. Repeating this process a sufficient number of times again results in the expected distribution under the Null model. Furthermore a Markov Chain Monte Carlo approach can be constructed which, e.g. conditions the network on the number of observed triangles . Such an approach is in practice, however, computationally expensive and does not appear to have been used widely . In the meantime, however, elegant analytic approaches have been developed which allow the statistical assessment of network motif exceptionality .
Conditional rewiring: GOcardShuffle
as all other edges in and ' remain unaffected by the proposed change.
The configuration remains unchanged with probability 1 - p, whence a new configuration change will be proposed.
Thus GOcardShuffle – because of the general properties of MCMC [34, 35] – will result in a Markov chain which has as its stationary distribution the ensemble of networks (defined by Pr(ω|)) which condition on the degree sequence (by virtue of fixing the degree of each node) and on the weight matrix ω (by construction of the chain).
As in all MCMC approaches it is important to run the algorithm for a sufficiently long period to remove dependence on the initial configuration and to reach the stationary distribution of the Markov Process (the burn-in period). After that the chain produces highly correlated configurations so configurations are sampled only after a sufficiently large number of steps in the chain (this is referred to as the thinning-out interval) [35, 36]. Choice of the length for burn-in and thinning-out intervals require experimentation and/or fine-tuning. In GOcardShuffle the default parameter for the burn-in period is 100 × M steps, while the thinning-out interval has a length of 10 × M steps.
In the discussion we have thus far assumed that each protein has only one annotation. Two additional factors are straightforwardly included in GOcardShuffle:
Multiple annotations: For many proteins we have more than one annotation. This can be due to a protein being found in more than one cellular component; being involved in more than one biological process; or having more than one molecular function; or any combination of the above.
Multiple annotation categories: Above we have chosen to group proteins together if they have identical annotations. Thus ν x is the number of proteins with the same annotation x; this means that they all have the same annotation regarding function, process and component. If each category has 30 annotations then we need to consider 27,000 unique annotations and approximately 3.6 × 108 different combinations x, y ∈ γ, most of which will be zero.
This assumes that annotations x1 and x2 are equally important in describing the biological characteristics of protein x. If, for example, x1 is more relevant then we would have to replace Eqn. (10) by . In most cases, however, present information will not be sufficient to introduce reliable weightings of multiple annotations for each protein.
With Eqn. (11) the normalization of the edge probabilities ωx,yis trivially maintained. Multiple annotations are therefore straightforwardly and parsimoniously dealt with. Once annotations become very reliable and detailed it will, however, be possible to introduce weightings on different annotations. Alternatively to Eqn. (11) we may determine that a combination of annotations x = (x1, x2,...,x α ) defines a new annotation. This could be advantageous if proteins that have more than one function, i.e. annotation x ' tend to interact predominantly with proteins that have a different annotation x" (or set of annotations x "). Clearly in such an event the simple ansatz given by Eqn. (11) may give rise to interactions among proteins that would never interact in real life. Combining annotations into a new single annotation is possible by preprocessing the annotation input-data prior to using GOcardShuffle. Given the present state of the data (both PIN and annotations) we believe that using the approach given by Eqn. (11) puts less emphasis on potentially erroneous data; in the future, however, it will be possible to go beyond this approach by considering dependencies among sets of annotations.
Dealing with the potentially very large number of different annotations requires more careful consideration. In addition to the computational challenges of dealing with very large matrices, ω = (ω xy ), taking annotations as "true" could be problematic as it may severely limit the size of the network ensemble that is defined through the stationary distribution of the Markov Chain defined by GOcardShuffle as most entries in ω will be zero. An additional problem is that GO annotation is only approximate and when protein-interaction data has been used to annotate proteins in silico errors in either the interaction data or GO annotations may be propagated. A pragmatic if approximate solution is to divide the annotations into the three different categories: molecular function; biological process; and cellular component. We thus define 3 different matrices, one for each category
ω k for k = 1, 2, 3 or k = F, P, C
(or any combination of pairs of annotations, = ω i ⊗ ω j for i, ≠ j ∈ 1, 2, 3). In Eqn. (12), ⊗ denotes the standard tensor product  of the weight matrices. This has also numerical and computational advantages as we only have to store three (or two) small matrices (typically we use approximately 30 annotations per category) rather than one very large matrix. Eqn. (12) will for real networks only be approximate if the different GO categories are themselves correlated (which we know to be the case for yeast and other organisms for which extensive GO annotation data has been assembled) and it will be necessary to test whether this approximation is reasonable (in the data presented here we found acceptably small differences between the true and approximate weights). Nevertheless, even if only the approximation is used, any systematic differences between classical rewiring approaches and the network instances created by GOcardShuffle will highlight confounding factors which ought to be included in the construction of network confidence measures.
The GOcardShuffle algorithm can be summarized as follows
Generate set of stubs from true network
while free stubs do
Choose two stubs uniformly from those remaining and create an edge between them
for i = 0 to λ do
Choose two edges a and b in current network at random, uniformly
Calculate p using Eqn. (6).
Generate random value 0 <r < 1
if r <p or p > 1 then
Cross over a and b in network
The chain is sampled at intervals of λ1 steps, after a burn-in period of λ0 steps. For GOcardShuffle the default values are
λ0 = 100 × M and λ1 = 10 × M
(M is again the number of edges in the network). If L conditionally rewired network instances are required then λ = 100 × M + 10 × M × L.
where n is the number of times the motif is found in the true network, ⟨n⟩ is the average number of times the same motif is found in the rewired networks, and σ n is the standard deviation of motif counts in the rewired network. reduce the amount combinations that need to be considered, only nodes within a path of length 1 from the current node are considered for the choice of the second node, and nodes within a path of length 1 from the first or second node for the third node.
The methods described above were implemented in Python, as well as for the R statistical environment  (computationally intensive routines were implemented in C); R was also used for all statistical analyses. The source code for the GOcardShuffle algorithm is available from our website .
TT would like to thank the Wellcome Trust for providing a PhD studentship. C. Wiuf, W. Kelly, I. Holmquist, M. de Iorio, S. Dobbins, and, especially, S. Richardson are thanked for helpful discussions. MPHS gratefully acknowledges financial support from the Wellcome Trust and EMBO through a Young Investigator Fellowship.
- Tucker C, Gera J, Uetz P: Towards an understanding of complex protein networks. Trends Cell Biol. 2001, 11: 102-106. 10.1016/S0962-8924(00)01902-4.View ArticlePubMedGoogle Scholar
- Gavin M, Bosche M, Krause R, Grandi P, Marzioch M, Schultz J, Rick J, Michon A, Cruciat C, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Hudak M, Dickson D, Rudi T, Ganu V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier M, Copley R, Edelmann A, Querfurth E, V R, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, G SF: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147. 10.1038/415141a.View ArticlePubMedGoogle Scholar
- Luscombe N, Babu M, Yu H, Snyder M, Teichmann S, Gerstein M: Genomic analysis of regulatory network dynamics reveals large topological change. Nature. 2004, 431: 308-312. 10.1038/nature02782.View ArticlePubMedGoogle Scholar
- Yu H, Greenbaum D, Lu HX, Zhu X, Gerstein M: Genomic analysis of essentiality within protein networks. Trends Genet. 2004, 20 (6): 227-31. 10.1016/j.tig.2004.04.008.View ArticlePubMedGoogle Scholar
- Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM: Protein interaction networks from yeast to human. Curr Opin Struct Biol. 2004, 14 (3): 292-9. 10.1016/j.sbi.2004.05.003.View ArticlePubMedGoogle Scholar
- Agrafioti I, Swire J, Abbott I, Huntley D, Butcher S, Stumpf M: Comparative analysis of the Saccaromyces cerevisiae and Caenorhabditis elegans protein interaction networks. BMC Evolutionary Biology. 2005, 5: 23-10.1186/1471-2148-5-23.PubMed CentralView ArticlePubMedGoogle Scholar
- Bader JS, Chaudhuri A, Rothberg JM, Chant J: Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol. 2004, 22: 78-85. 10.1038/nbt924.View ArticlePubMedGoogle Scholar
- Stumpf M, Wiuf C, May R: Subnets of scale-free networks are not scale-free: the sampling properties of networks. Proc Natl Acad Sci USA. 2005, 102: 4221-4224. 10.1073/pnas.0501179102.PubMed CentralView ArticlePubMedGoogle Scholar
- Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science. 2002, 296 (5568): 750-2. 10.1126/science.1068696.View ArticlePubMedGoogle Scholar
- Lemos B, Meiklejohn C, Hartl D: Regulatory evolution across the protein interaction network. Nat Genet. 2004, 36 (10): 1059-60. 10.1038/ng1427.View ArticlePubMedGoogle Scholar
- Wagner A: The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol. 2001, 18 (7): 1283-1292.View ArticlePubMedGoogle Scholar
- Jordan IK, Wolf YI, Koonin EV: No simple dependence between protein evolution rate and the number of protein-protein interactions: only the most prolific interactors tend to evolve slowly. BMC Evol Biol. 2003, 3: 1-10.1186/1471-2148-3-1.PubMed CentralView ArticlePubMedGoogle Scholar
- Hahn MW, Conant GC, Wagner A: Molecular evolution in large genetic networks: does connectivity equal constraint?. J Mol Evol. 2004, 58 (2): 203-11. 10.1007/s00239-003-2544-0.View ArticlePubMedGoogle Scholar
- Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U: Network motifs: Simple building blocks of complex networks. Science. 2002, 298 (5594): 824-827. 10.1126/science.298.5594.824.View ArticlePubMedGoogle Scholar
- Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U: Superfamilies of evolved and designed networks. Science. 2004, 303 (5663): 1538-1542. 10.1126/science.1089167.View ArticlePubMedGoogle Scholar
- Berg J, Lässig M: Local graph alignment and motif search in biological networks. Proc Natl Acad Sci USA. 2004, 101 (41): 14689-14694. 10.1073/pnas.0305199101.PubMed CentralView ArticlePubMedGoogle Scholar
- Kashtan N, Itzkovitz S, Milo R, Alon U: Topological generalizations of network motifs. Phys Rev E Stat Nonlin Soft Matter Phys. 2004, 70 (3 Pt 1): 031909-
- Drummond D, Raval A, Wilke C: A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol. 2006, 23: 327-337. 10.1093/molbev/msj038.View ArticlePubMedGoogle Scholar
- Xenarios I, Rice D, Salwinski L, Baron M, Marcotte E, Eisenberg D: Dip: the database of interacting proteins. Nucl Acid Res. 2000, 28: 289-291. 10.1093/nar/28.1.289.View ArticleGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik R, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, Roechert B, Poux S, Jung E, Mersch H, Kersey P, Lappe M, Li Y, Zeng R, Rana D, Nikolski M, Husi H, Brun C, Shanker K, Grant S, Sander C, Bork P, Zhu W, Pandey A, Brazma A, Jacq B, Vidal M, Sherman D, Legrain P, Cesareni G, Xenarios L, Eisenberg D, Steipe B, Hogue C, Apweiler R: The HUPOPSI's Molecular Interaction format – a community standard for the representation of protein interaction data. NATURE BIOTECHNOLOGY. 2004, 22 (2): 177-183. 10.1038/nbt926.View ArticlePubMedGoogle Scholar
- Reguly T, Breitkreutz A, Boucher L, Breitkreutz B, Hon G, Myers C, Parsons A, Friesen H, Oughtred AR, amd Tong, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya O, Ideker T, Dolinski K, Batada N, Tyers M: Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. J Biol. 2006, 5: 11-10.1186/jbiol36.PubMed CentralView ArticlePubMedGoogle Scholar
- FatiGO a web tool for finding significant associations of Ontology terms with groups of genes. Bioinformatics. 2004, 20: 578-580. 10.1093/bioinformatics/btg455.
- Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D, Davies R: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell. 1998, 2: 65-73. 10.1016/S1097-2765(00)80114-8.View ArticlePubMedGoogle Scholar
- Functional and topological characterization of protein interaction networks. Proteomics. 2004, 4 (4): 928-42. 10.1002/pmic.200300636.
- de Silva E, Thorne T, Ingram P, Agrafioti I, Swire J, Wiuf C, Stumpf M: The effects of incomplete protein interaction data on structural and evolutionary inferences. BMC Biology. 2006, 4: 39-10.1186/1741-7007-4-39.PubMed CentralView ArticlePubMedGoogle Scholar
- Picard F, Daudin JJ, Schbath S, Robin S: Assessing the exceptionality of network motifs. 2006, [Research Report], [http://genome.jouy.inra.fr/ssb/preprint/]Google Scholar
- Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E: Equation of State Calculations by Fast Computing Machines. J Chem Phys. 1953, 21: 1087-1092. 10.1063/1.1699114.View ArticleGoogle Scholar
- Ripley BD: Stochastic Simulation. 1987, WileyView ArticleGoogle Scholar
- Robert C, Casella G: Monte Carlo Statistical Methods. 2004, Springer, 2View ArticleGoogle Scholar
- Newman M, Barkema G: Monte Carlo Methods in Statistical Physics. 1999, Clarendon PressGoogle Scholar
- Arfken G, Weber H: Mathematical Methods for Physicists. 2005, Academic Press, 6Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.