Betweenness centrality and its limitations in analyzing regulatory networks
In regard to the needs of an analysis of regulatory networks there are two major disadvantages of betweenness centrality. Firstly, shortest paths are supposed to be the most important ones, which is a big oversimplification and misleading. The importance of a path is determined not so much by its length, i.e., the number of reactions, but rather by the integral efficiency of all these reactions. This efficiency depends on many instances, such as the concentrations of the participants, rate constants, etc. Longer paths can be faster and more efficient than shorter ones. For instance, in regulatory networks, the initiation of transcription and translation is typically governed by sets of specific factors. This increases the length of the corresponding paths, but drastically improves the efficiency and specificity of these processes. In a similar way, scaffold and adaptor proteins, which themselves are not enzymes, recruit downstream effectors in signaling pathways and enhance both the efficiency and specificity of signal propagation. Moreover, in most regulatory networks, like gene networks, an inherent problem is that the real length of edges is not defined at all. Each single edge commonly summarizes a set of events and describes the causal relations between genes. But this kind of abstraction does not say anything about the complexity and length of the corresponding processes. Thus, dealing with inconsistent semantics of the edges renders the definition of a shortest path in these networks highly problematic.
Secondly, betweenness centrality can be applied only to vertices that are between other ones. Peripheral vertices, i.e., vertices having either zero incoming or outgoing degree, are not considered. That immediately excludes many extracellular ligands, receptors, target molecules and genes from the analysis of a signaling network (Figure 1). Such components, however, directly respond to input-output functionality of the network and therefore are of key significance. Moreover, their individual topological significance in the network may vary in a wide range, as it can be seen when comparing the connectedness of the start-points S1 and S2, or end-points T1 and T2 in Figure 1. However, in terms of betweenness centrality, all of them are attributed with zero values which fail to reflect the individual connectedness of such input/output elements within the whole network.
We therefore developed the concept of the pairwise disconnectivity index as a new topological metric, which evaluates alternative though longer paths as well and can be used to characterize the topological significance of all individual elements in biological regulatory networks. The approach has some similarity to numerical parameters like vertex-connectivity or edge-connectivity used in graph theory to measure a graph's connectedness [19]. However, our method does not focus on how the removal of distinct elements breaks a given connected graph into disconnected pieces, like the algorithm of Girvan and Newman [12], though a network's disintegration can be considered as well. Instead, our aim was to find a parameter describing more moderate effects in a still connected network.
Topological significance of individual elements in a regulatory network
In a directed graph G(V,E) representing a regulatory network, the vertices v ∈ V denote biological entities, e.g., proteins, genes, or small molecules. Causal relationships between these entities are made up of directed edges e ∈ E. We denote the topological significance of an individual element (vertex, edge or their combination) as how essential for all connections in the network this element is. To quantify this significance we suggest to measure how the elimination of such an element affects the number of connected ordered pairs of vertices. An ordered pair of vertices {i, j} ¦ i ≠ j and i, j ∈ V, is connected iff there is at least one path from vertex i to vertex j in G. Note, that the ordered pair {i, j} is different from {j, i} in a directed network. The more ordered pairs become disconnected upon the removal of vertex v, the higher is the topological significance of this vertex. We define the pairwise disconnectivity index of vertex v, Dis(v), as the fraction of those initially connected pairs of vertices in a network which become disconnected if vertex v is removed from the network
(1)
Here, N0 is the total number of ordered pairs of vertices in a network that are connected by at least one directed path of any length. It is supposed that N0 > 0, i.e., there exists at least one edge in the network that links two different vertices. N-vis the number of ordered pairs that are still connected after removing vertex v from the network, via alternative paths through other vertices (see vertex 2 in Figure 2B,C). However, the relation of N-vand N0 conveyed by Dis(v) immediately uncovers the fraction of connected ordered pairs whose communication essentially depends on vertex v. In the extreme case the removal of vertex v destroys all communication in a network resulting in Dis(v) = 1. In contrast, Dis(v) = 0 refers to a non-crucial vertex which is obviously not connected to any other vertex in a network.
The example presented in Figure 2 also illustrates the difference between the pairwise disconnectivity index and betweenness centrality. Vertex 2 is characterized with equally high (case A) or low (case B) values of both centralities, whereas they largely differ in case C (high betweenness centrality, but low pairwise disconnectivity index). The toy network in Figure 3 further illustrates that betweenness centrality and pairwise disconnectivity index reflect different properties of a vertex in a network. While the vertices 4 and 7 are mediating most of the shortest paths, thereby exhibiting a very high betweenness centrality value, these vertices show a rather low pairwise disconnectivity index since they provide alternative paths. In contrast, vertex 1 displays modest betweenness centrality but has a high topological significance according to its disconnectivity value (Figure 3). Thus, a vertex with high betweenness is not obligatorily topologically significant according to its disconnectivity value. It is only a clue for the fraction of short communication paths between reachable vertices which are provided due to the existence of a particular vertex.
Furthermore, the difference between the pairwise disconnectivity index and betweenness centrality becomes apparent when taking a closer look into the kind of reachable ordered pairs whose connection depends on vertex v. The complete set of those pairs, N0 - N-v, may include those which are connected by 1) paths that end at vertex v, 2) paths that start at vertex v, and 3) paths that go through vertex v. Other pairs cannot be affected, since they are connected via paths that do not contain any of the edges around vertex v. Accordingly, the pairwise disconnectivity index of vertex v can be represented as follows
(2)
The term σ
st
(v) in Eq. 2 expresses the number of ordered pairs {s,t} ¦ s ≠ t ≠ v and s, t, v ∈ V that are exclusively linked through vertex v. Both, σ
sv
and σ
vt
involve v and represent the path-degree of vertex v in terms of all incoming and outgoing paths, respectively. Altogether, σ
st
(v) is not a trivial combination of σ
sv
and σ
vt
as Figure 2 shows: Vertex 2 is indeed crucial for connecting vertex 1 to vertex 3 in graph 2A. But in graphs 2B and 2C the same connection 1 → 3 does not depend on vertex 2 anymore, because of the parallel paths. However, vertex 2 still is essential for all paths that start or end in this vertex. The number of such ordered pairs associated with vertex 2, σs 2and σ2t, does not change in the graphs 2A, 2B and 2C, thereby indicating the absence of a simple relationship between the values of σ
sv
, σ
vt
and σ
st
(v).
Often one wants to know how many connected pairs {i,j} depend on a particular vertex v while disregarding those kinds of pairs that involve the considered vertex, i.e. where v ≠ i and v ≠ j. For example, when analyzing the role of a receptor for the indirect communication of extracellular ligands with transcription factors, communication paths that start or end at the receptor need not to be considered. The term σ
st
(v) in equation 2 exactly comprises this sort of essentiality and we define
(3)
as the mediative disconnectivity index of a vertex v. It immediately detects the fraction of connected ordered pairs of vertices different from v for whose reachability vertex v is necessary. While the pairwise disconnectivity index of vertex 2 in Figure 2A involves the pairs {1,2}, {1,3} and {2,3} it's mediative disconnectivity index reveals that vertex 2 is uniquely bridging the connection for {1,3}.
The mediative disconnectivity index of a vertex may exhibit some similarity to the beweenness centrality of the vertex. A path that uniquely connects two vertices i and j and is destroyed after removing another vertex v is always the shortest path between i and j. However, betweenness centrality considers all shortest paths and MDis(v) uncovers the cases where vertex v is the only link for a connected pair i and j. The principal difference between these parameters is due to their different sensitivity to the presence of parallel paths: betweenness centrality is insensitive to the presence of longer bypasses, whereas MDis(v) is very sensitive to that.
Vertex removal is a strong interference in a network because it simultaneously removes all incoming and outgoing edges of that vertex. One can also perturb a network by selectively knocking out a particular edge. This is a relatively gentle intervention which can simulate various normal and pathological situations in a regulatory network when all components are still present, but due to a mutation in one of them some of its reactions are specifically disabled while others are still working. That is particularly important when considering the fact that edges are a kind of abstraction and simplification, as discussed above. Thus, we declare an edge as topologically significant in the same way as a vertex: The higher the number of ordered pairs that become disconnected the higher the topological significance of an eliminated edge. To quantify this, we introduce the pairwise disconnectivity index of an edge, Dis(e), which is defined as
(4)
Again, N0 is the number of ordered pairs of vertices connected by means of at least one directed path in the network. N-eis the number of such pairs after removing edge e from the network. The pairwise disconnectivity index of an edge ranges between 0 ≤ Dis(e) ≤ 1. In Figure 2A we previously argued the dependence of the communication of the ordered pair {1,3} on vertex 2. With the disconnectivity index of an edge it becomes clear that it is not necessary to remove vertex 2 itself in order to destroy the pair {1,3}. Moreover, a disorder of either the incoming or outgoing edge of vertex 2 is enough to compass the same effect.
Topological significance of a group of elements in a regulatory network
Not all major functional breakdowns of a network can be explained due to the failure of one single element, but rather to the dysfunction of a subset of vertices or edges. The malfunctioning of this subset may disrupt a significant number of communication lines because parallel paths may be destroyed simultaneously. For example, in Fig. 2C the ordered pair {1,3} stays connected unless the vertices 2 and 4 or 2 and 5 are taken out together. As the generalization of Eq. 1 we define the pairwise disconnectivity index of a group of vertices, W ⊆ V, as
(5)
with N-Wrepresenting the number of connected ordered pairs after removing the set of vertices W. Note that Dis(W) cannot be inferred directly from the disconnectivity indices of individual vertices in W. This is due to the presence of parallel paths in a network. For example, vertex 4 (or vertex 7) in Figure 3 features a rather low pairwise disconnectivity index. But as part of the group 'vertex 4 AND vertex 7' it causes the network to split into two distinct parts.
Finally, in analogy to Eq. 4 the general case of the removal of an individual edge is given by the pairwise disconnectivity index of a group of edges, F ⊆ E, as defined in Eq. 6.
(6)
Here also, Dis(F) cannot be inferred directly from the disconnectivity indices of individual edges in F.
Applying the pairwise disconnectivity index to the analysis of biological regulatory networks
In a topological analysis of several biological networks (one signal transduction network, two transcription regulation networks, and a neuronal connectivity network), we comparatively evaluated the pairwise disconnectivity index of the individual vertices with their betweenness centrality.
Transcription networks are displayed here as directed graphs, in which the nodes represent transcription factor genes and edges represent regulatory relationships between them, i.e., the transcriptional regulation of another transcription factor gene. We used the two best characterized transcription regulation networks from organisms of different kingdoms: a bacterium (Escherichia coli) [20] and a unicellular eukaryote (the yeast Saccharomyces cerevisiae) [21].
The E. coli transcriptional regulatory network consists of 423 vertices and 578 edges [20]. Small values of both B(v) and Dis(v) are attributed to most vertices in these networks, as it can be seen from the mean values of B(v) and Dis(v) (Figure 4). There is a strong positive correlation between the pairwise disconnectivity indices, Dis(v), and the corresponding values of betweenness centrality, B(v), for many genes, among them arcA, ompR_envZ, hns, rpoH, fliAZY, and flhDC. Their Dis(v) tends to be directly proportional to B(v) (Figure 4). However, we have found many exceptions to this trend. These are genes that exhibit low betweenness but relatively high disconnectivity: crp, himA, fnr, rpoE_rseABC, yhdG_fis, cspA, and nipd_rpoS. Gene crp shows the highest pairwise disconnectivity index. In the network analyzed, most of these genes display both nonzero incoming degree (k
in
> 0) and nonzero outgoing degree (k
out
> 0) and therefore have an internal position in the network. The protein product of gene crp is a well-characterized transcription activator triggered by cAMP and is responsible for regulating the expression of more than 100 genes in E. coli [22]. Moreover, genes crp (CRP), fnr (FNR) and fis (FIS) belong to the few global transcriptional regulators which are sufficient for directly modulating the expression of 51% of all genes in E. coli [23]. Betweenness centrality fails to identify them as topologically significant ones.
Similar 'predictive weakness' of betweenness centrality is observed in the transcriptional network of S. cerevisiae (Figure 5). This network consists of 688 vertices and 1079 edges. Again, there is a strong positive correlation between the pairwise disconnectivity index of individual genes and the corresponding value of beweenness centrality. Such genes show a diagonal positioning on the plot. Small values of both B(v) and Dis(v) are attributed to most vertices in these networks, which thereby exhibit low topological significance. However, many genes with B(v) = 0, like REB1, UME6, MIG1, STE12, have high values of Dis(v) (Figure 5). In the network analyzed, all these genes exhibit no incoming degree (k
in
= 0) and are therefore positioned at the periphery of the network. The relatively large value of the pairwise disconnectivity index for these genes is in accordance with the roles they play in yeast. The product of gene REB1 (RNA polymerase I enhancer binding protein) is a DNA-binding protein that recognizes sites in both the enhancer and the promoter of rRNA transcription, as well as upstream of many genes transcribed by RNA polymerase II [24]. REB1 is essential for cell growth: its deletion mutant is inviable [25]. The other three genes of this group (UME6, MIG1, STE12) have important functions too, and deleting them solicits altered phenotypes, but is not lethal [26–31] [see Additional file 1]. Among those that have equally high values of the pairwise disconnectivity index and betweeness centrality, MCM1 is vital for the yeast cell [25, 32]. Thus, at least one essential gene (REB1) was detected by the pairwise disconnectivity index, but this gene would have been missed by betweenness centrality because of its peripheral position in the network considered.
We next analyzed the neuronal connectivity network of a simple multicellular organism, i.e. the nematode Caenorhabditis elegans [33]. Here, nodes represent neurons, and edges denote synaptic connections between the neurons. Each synaptic connection propagates a nerve impulse in one direction. This regulatory network includes 252 vertices and 509 directed edges. We found the same trend as in the transcription regulatory networks mentioned above: there are many vertices that display a low betweenness centrality combined with a high pairwise disconnectivity index (Figure 6): In contrast to the pairwise disconnectivity index, the betweenness centrality seems to underestimate the topological significance of some nodes, although we cannot comment here on their biological relevance since this is not documented.
The last example of regulatory networks refers to higher eukaryotes and is represented by the mammalian Toll-like receptor 4 (TLR4) signaling network. It controls a protective response of a host cell to a bacterial intervention and is important in activating the innate immunity [34, 35]. The network consists of all signaling molecules that are reachable from the TLR4 receptor or from which the TLR4 receptor is reachable according to the contents of the TRANSPATH® database on signal transduction [36]. It comprises of 742 vertices (molecules) and 1952 edges (reactions) and represents a genome-wide view at a level above the individual mammalian species. The contribution of individual vertices to sustaining the integrity of these paths varies significantly with the mean pairwise disconnectivity index of 0.0044 (Figure 7). That is, an average vertex is a crucial part of only 0.44% of the existing directed paths in the TLR4 network, thereby indicating the robust topological organization of the network. There are many molecules, like Myt1 (myelin transcription factor 1), Cdk1 (cyclin-dependent kinase 1), ERK2 (mitogen-activated protein kinase 2), p53 (tumor suppressor p53) and others, whose disconnectivity potential significantly exceeds this average level (Figure 7). Interestingly, all of them exhibit a lethal knockout effect in mice [see Additional file 1]. The pairwise disconnectivity index of vertices positively correlates with the corresponding values of betweenness centrality. In contrast to the transcriptional regulatory networks from E. coli and S. cerevisiae and the neuron connectivity network from C. elegans (Figures 4, 5, 6), the mammalian TLR4 network does have vertices which exhibit both low B(v) and high Dis(v) values. Moreover, the relationship of the pairwise disconnectivity index and betweenness centrality in the network is much more scattered. The bigger B(v) and Dis(v), the broader the scattering. Thus, there are many molecules which do not differ in their B(v) value, but significantly differ in their Dis(v) values and vice versa. Molecules Abl and PDK1 display the highest levels of B(v), but they are moderate in terms of Dis(v). That is, Abl and PDK1 are highly engaged in shortest-path communication in the network, but there are longer paths able to sustain the communication if either Abl or PDK1 is absent. In contrast to that, molecules Myt1, Cdk1 and ERK2 show the highest values of Dis(v), but they are moderate in terms of B(v) which means that although these proteins are not the most significant mediators of shortest-path communication in the TLR4 network they nevertheless provide the biggest impact on the topology of the network. Altogether, all these examples demonstrate that Dis(v) and B(v) represent different aspects of network organization.
In order to determine the most significant vertices that are conveying the communication between others, we calculated the mediative disconnectivity indices of all vertices, MDis(v), in the above mentioned networks and plotted them versus the corresponding values of betweenness centrality. The transcriptional networks from E. coli and S. cerevisiae and the neuron connectivity network from C. elegans show almost an ideal linear interdependence of MDis(v) and B(v) characterized by the correlation coefficients 0.99, 0.99 and 1.0, respectively [see Additional files 2, 3 and 4]. The corresponding mean values of MDis(v) are very small: 0.0008, 0.0006, and 0.004, respectively. Therefore, a small fraction of vertices are crucial as mediators of communication in these networks. Taken together, these networks, according to the present state of knowledge, appear to avoid significant parallelism of their paths and are relatively simply organized. In sharp contrast to that, the relationship of MDis(v) and B(v) in the mammalian TLR4 network is very scattered (Figure 8) and comparable with that of Dis(v) and B(v) (Figure 7). This network exhibits a higher complexity as compared to the previous ones. In that case, again, MDis(v) and B(v) characterize different aspects of network organization.