Generating confidence intervals on biological networks

Background In the analysis of networks we frequently require the statistical significance of some network statistic, such as measures of similarity for the properties of interacting nodes. The structure of the network may introduce dependencies among the nodes and it will in general be necessary to account for these dependencies in the statistical analysis. To this end we require some form of Null model of the network: generally rewired replicates of the network are generated which preserve only the degree (number of interactions) of each node. We show that this can fail to capture important features of network structure, and may result in unrealistic significance levels, when potentially confounding additional information is available. Methods We present a new network resampling Null model which takes into account the degree sequence as well as available biological annotations. Using gene ontology information as an illustration we show how this information can be accounted for in the resampling approach, and the impact such information has on the assessment of statistical significance of correlations and motif-abundances in the Saccharomyces cerevisiae protein interaction network. An algorithm, GOcardShuffle, is introduced to allow for the efficient construction of an improved Null model for network data. Results We use the protein interaction network of S. cerevisiae; correlations between the evolutionary rates and expression levels of interacting proteins and their statistical significance were assessed for Null models which condition on different aspects of the available data. The novel GOcardShuffle approach results in a Null model for annotated network data which appears better to describe the properties of real biological networks. Conclusion An improved statistical approach for the statistical analysis of biological network data, which conditions on the available biological information, leads to qualitatively different results compared to approaches which ignore such annotations. In particular we demonstrate the effects of the biological organization of the network can be sufficient to explain the observed similarity of interacting proteins.


Effect of Dataset choice on GOCardShuffle results
The poor quality of different protein interaction datasets has attracted great interest in the literature (see e.g. [2,5,9]). We have applied the GOCardShuffle algorithm to a range of datasets (see figure 1). In each case we find that the confidence intervals obtained without and with conditioning on available GO annotation are significantly different. For all datasets we also observe that the GOCardShuffle confidence intervals for Kendall's τ rank correlation coefficient overlap the observed correlation in each network dataset but for the literature curated (denoted by LC in figure 1) data of Reguly et al. [5] and the DIP CORE dataset [1,2,10] we find that the expression levels of interacting proteins are more similar than would be expected to be the case by chance. We note, however, that on this level, in addition to dataset choice also the correlation measure becomes important and for a given dataset GOCardShuffle Null distributions for one correlation coefficient may overlap the observed network statistic while for a different correlation coefficient this is no longer the case. In such a case evidence for similarity of the properties of interacting proteins would probably have to be considered marginal.
In summary, however, this shows that there is a need for statistical methods that condition on the available data.

Illustration of GOCardShuffle
The algorithm outlined in the manuscript and implemented in the accompanying software (and the NetZ package; www.imperial.ac.uk/theoreticalgenomics/data-software).
In figure 2 we show the distribution of edge numbers connecting nodes with certain funcional categories resulting from 400 conventionally rewired networks(black histograms) and from 400 networks where rewiring was conditioned on GO functions using GOCardShuffle. Although this is only a small selection of within and between category edges these are representative for the remaining cases. In each case we find that the histogram obtained using GOCardShuffle overlaps the observed value. For some instances, notably for between-category edges where one protein has no known annotation, unconditional rewiring results in a distribution which covers the observed number of between-category edges in the original network. In particular unconditional rewiring underestimates the relative prevalence of within-category connections [3,7].
For a single annotation matrix ω the number of edges between proteins beloning to categories i and j, m ij under unconditional resampling has the expectation value where M is the total number of edges in the network, N x is the number of nodes/proteins beloning to category i and N is the total number of nodes in the network; its variance is given by the standard multinomial variance Since edges are sampled uniformly the Markov Chain generated by Eqns. (5) to (8) in the manuscript will converge to the desired stationary distribution for the within and between category connection pattern [6,8]. (e) (f ) Figure 2: Number of edges observed in the real network connecting nodes with specific functional GO categories (red line) and histograms of these numbers obtained from 400 independent network instances using unconditional rewiring (white) and conditional rewiring using GOcardShuffle (black), respectively. The different panels show edges connecting proteins belonging to the following functional categories: (a) "proteinfunction unknown" -"RNA binding"; (b) "protein-function unknown" -"protein binding"; (c) "proteinfunction unknown" -"protein-function unknown"; (d) "protein-function unknown" -"RNA binding"; (e) "RNA binding" -"protein binding"; (f) "transcription regulator activity" -"RNA binding".

Evolutionary Rate
The variability of the number of edges between and with categories, cannot be assessed analytically for all but regular random graphs (i.e. graphs where each node has the same fixed degree but edges are distributed at random) as there is a delicate interplay between (i) the variance induced by the Metropolis sampler, and (ii) the variation induced by heterogenous node degrees. The results shown in figure 2, together with the observation that the histograms for within and between-category edge numbers always include the observed value, fills us with confidence, that GOCardShuffle does indeed preserve the hallmarks of the original network data.