Gene network interconnectedness and the generalized topological overlap measure

Background Network methods are increasingly used to represent the interactions of genes and/or proteins. Genes or proteins that are directly linked may have a similar biological function or may be part of the same biological pathway. Since the information on the connection (adjacency) between 2 nodes may be noisy or incomplete, it can be desirable to consider alternative measures of pairwise interconnectedness. Here we study a class of measures that are proportional to the number of neighbors that a pair of nodes share in common. For example, the topological overlap measure by Ravasz et al. [1] can be interpreted as a measure of agreement between the m = 1 step neighborhoods of 2 nodes. Several studies have shown that two proteins having a higher topological overlap are more likely to belong to the same functional class than proteins having a lower topological overlap. Here we address the question whether a measure of topological overlap based on higher-order neighborhoods could give rise to a more robust and sensitive measure of interconnectedness. Results We generalize the topological overlap measure from m = 1 step neighborhoods to m ≥ 2 step neighborhoods. This allows us to define the m-th order generalized topological overlap measure (GTOM) by (i) counting the number of m-step neighbors that a pair of nodes share and (ii) normalizing it to take a value between 0 and 1. Using theoretical arguments, a yeast co-expression network application, and a fly protein network application, we illustrate the usefulness of the proposed measure for module detection and gene neighborhood analysis. Conclusion Topological overlap can serve as an important filter to counter the effects of spurious or missing connections between network nodes. The m-th order topological overlap measure allows one to trade-off sensitivity versus specificity when it comes to defining pairwise interconnectedness and network modules.


Background
We consider undirected, unweighted biological networks that can be represented by a symmetric adjacency matrix A = [a ij ]. The adjacency a ij between nodes i and j equals 1 if the nodes are connected and 0 otherwise. For notational convenience, we set the diagonal elements to 1. While the adjacency matrix considers each pair of genes in isolation, topological overlap considers each pair of genes in relation to all other genes in the network. More specifically, genes are said to have high topological overlap if they are connected to roughly the same group of genes in the network (i.e. they share the same neighborhood). To calculate the topological overlap for a pair of genes, their connections with all other genes in the network are compared. If the 2 nodes connect to the same group of other nodes, then they have a high 'topological overlap'. Here we study the properties of the topological overlap measure (TOM) and propose a generalization that enriches TOM's sensitivity to longer ranging connections between nodes.
There is empirical evidence that two substrates having a higher overlap are more likely to belong to the same functional class than substrates having a lower topological overlap [1][2][3][4][5]. Such a finding prompts the question whether a measure of topological overlap based on higher-order neighborhoods would lead to a more sensitive and robust measure of interconnectedness. In this paper, we generalize the topological overlap measure by incorporating information from higher-order neighborhoods and show that it leads to a definition of larger modules. Specifically, the m-th order topological overlap measure is constructed by (i) counting the number of mstep neighbors that a pair of nodes share and (ii) normalizing it to take a value between 0 and 1. The resulting node similarity measure is a measure of agreement between the m-step neighborhoods of 2 input nodes. Such a measure can be applied in a number of ways, for instance, ranking the genes, similarity search, prediction based on k-nearest neighbors, multi-dimensional scaling and module identification by clustering.

Results
The algebraic definition of the topological overlap measure can be found in Eq. (7) in the Methods section. Here we provide a more intuitive set theoretic interpretation of the topological overlap measure. Our generalization of the TOM is motivated by the observation that Eq. (7) can be expressed as where N 1 (i) denotes the set of direct neighbors of i excluding i itself and |·| denotes the number of elements (cardinality) in its argument. The quantity |N 1 (i) ∩ N 1 (j)| measures the number of common neighbors that nodes i and j share whereas |N 1 (i)| gives the number of neighbors of i. The topological overlap t ij assumes a minimal value of 0 if there is no direct linkage between the two nodes and if they share no common direct neighbors. It assumes a maximum value of 1 if there is a direct link between the two nodes and if one set of direct neighbors is a subset of the other. The fact that t ij is bounded between 0 and 1 is used in the definition of the topological overlap based dissimilarity measure (see Eq. 4).
Generalizing TOM to m-step neighborhoods By denoting N m (i) (with m > 0) the set of nodes (excluding i itself) that are reachable from i within a path of length m, i.e., where dist(i, j) is the geodesic distance (shortest path distance) between i and j, we obtain a very natural generalization of the TOM, which reads as follows

Predicting essential proteins in a Drosophila protein network
Knock-out experiments in lower organisms (e.g. yeast, fly, worm) have shown that essential proteins tend to be more highly connected than non-essential proteins [7][8][9]. To illustrate the biological usefulness of the proposed interconnectedness measures, we set out to test the ability of GTOM to predict essential proteins using a Drosophila (fly) protein-protein interaction network from the BioGrid Database [10]. In our version of the database, the largest connected component was comprised of 2294 proteins with 21217 pairwise interactions. Knock-out experiments had implicated 282 essential proteins.
To illustrate the use of GTOMm, we will show that proteins that are highly interconnected with an essential protein have an increased chance of being essential as well.
Toward this end, we make use of the following terminology. A GTOMm neighborhood of size S around node i is defined as the set (i) of S genes with highest GTOM value with i. For example, node j is in the size S = 1 GTOMm neighborhood around node i if it has the highest topological overlap across all nodes. For simplicity, we ignore ties and define (i) as the set of genes that directly link to node i (irrespective of the neighborhood size S).
Since essential proteins may participate in the same pathway, it is biologically plausible that the GTOMm neighborhoods of an initial essential protein contain essential proteins as well. Below, we show that for a fixed neighborhood size S and a fixed essential protein, the GTOM2 neighborhood contains a higher proportion of essential proteins than the corresponding GTOM0 or GTOM1 neighborhoods. This provides indirect empirical evidence that the proposed higher order GTOM measure (m = 2) outperforms the standard GTOM measure (m = 1) in this application.
Specifically, we considered the neighborhoods around each of the 30 essential proteins with highest nodal degrees (referred to as essential hub proteins). Since on average the GTOM0 neighborhoods of these essential hub proteins contain 40 proteins, we considered the following neighborhood sizes S = 1,..., 40. For each of the essential hub proteins, we determined the GTOMm neighborhood for m = 0,1,2,3. For a given neighborhood of size S, we determined the proportion of essential proteins. Next we averaged these proportions across the 30 essential hub proteins. Figure 1 reports the average proportion of essential proteins (y-axis) versus different neighborhood sizes (x-axis).
Proportions of essential proteins among the S proteins that are most highly interconnected with a given essential hub protein in the Drosophila protein-protein interaction network Figure 1 Proportions of essential proteins among the S proteins that are most highly interconnected with a given essential hub protein in the Drosophila protein-protein interaction network. For a given neighborhood size S (x-axis) we averaged the results over 30 essential hub proteins (y-axis). The black horizontal line (GTOM0) represents the average proportion of essential proteins among the directly linked neighbors (adjacency = 1) of an essential hub protein.
As can be seen from Figure 1, GTOM2 performs better than the other measures in this application involving a relatively sparse network. For example, when considering neighborhoods comprised of 10 proteins (rank 10) based on GTOM0, GTOM1, GTOM2, GTOM3 the proportion of essential proteins is given by 0.59, 0.59, 0.68, and 0.58, respectively. We find that neighborhood analysis with GTOM2 leads to significantly better results than GTOM0 (Wilcoxon p-value = 0.034), GTOM1 (p-value = 0.015) and GTOM3 p-value = 0.02).

Module detection in a yeast co-expression network
There is evidence that genes and their protein products carry out cellular processes in the context of functional modules [11]. Thus, an important task in biological network analysis is to identify groups, or 'modules', of densely interconnected genes. Here we focus on module identification methods that are based on using a node dissimilarity measure in conjunction with a clustering method. Further, we assume that the nodes in a network module have high topological overlap with their neighbors. A review of alternative module detection methods is beyond the scope of this article, see e.g. [12][13][14][15][16][17][18][19][20][21].
To demonstrate the usefulness of the GTOM dissimilarity measures for module detection, we apply the proposed measures to gene co-expression networks constructed based on a microarray dataset recording gene expression levels during different stages of cell cycle in yeast [22]. Because the transcriptional response of cells to changing conditions involves the coordinated co-expression of genes encoding interacting proteins, studying co-expression patterns can provide insights into the underlying cellular processes [23,24]. As detailed in the Methods section, the co-expression network was constructed by thresholding the absolute pair-wise (Pearson) correlation coefficient between the expression profiles. We cluster the genes into modules using the average linkage hierarchical clustering with different choices of dissimilarity measures. Modules correspond to branches of the resulting clustering tree. While there is evidence that this clustering procedure leads to biologically meaningful modules in several applications [1][2][3][4][5]25], we do not claim that this clustering method is optimal. Since our interest lies in the performance of topological overlap based dissimilarity measures but not the clustering procedure, comparing different clustering procedures is beyond the scope of this article. In our applications, modules correspond to branches of a hierarchical clustering dendrogram [6]. Figure  Genes that belong to the functional class 'protein biosyn-thesis' are grouped together when the GTOM2 measure is used. However, they are separated into two distinct subgroups if GTOM0 or GTOM1 are used. This suggests that GTOM2 is a more sensitive measure for detecting the higher order connections between the nodes in this large module. Thus membership in the protein biosynthesis module is more robust when neighborhoods of step size 2 is used for measuring topological overlap.
For the sake of brevity, we only present our analysis of the protein biosynthesis module in this methodological paper. Since the protein biosynthesis pathway is relatively large, it makes sense to use a relatively sensitive dissimilarity measure (GTOM2) since it favors the discovery of large modules. However, when considering a functional category that involves few genes, it would be better to use a dissimilarity measure with higher specificity (GTOM0 or GTOM1) since it favors the discovery of smaller modules. A more detailed biological analysis of a related yeast coexpression network can be found in [3].

Comparing GTOMm to the correlation coefficient
Since the absolute value of the Pearson correlation coefficient is widely used for clustering gene expression profiles, we compare it here to the GTOM measures. Specifically, we consider the following class of correlation based dissimilarities: Here, ρ ij is the (Pearson) correlation coefficient between the expression profiles of i and j. Setting p = 1 yields the absolute correlation coefficient which is widely used for clustering genes. Setting p > 1 has the effect of emphasizing larger values of |ρ ij | while deemphasizing the smaller ones. We consider p = 6 since we find that the resulting distance is highly related to GTOM1 in the yeast dataset. Such a setting has also been used in [26] for functional annotation. Figure 3 shows the relationship between six dissimilarity measures, GTOM-based d T, [m] for m = 0,1,2,3 and correlation-based d C, [p] for p = 1,6. For the yeast co-expression network, we arrive at the following results. First, d C, [6] is highly correlated (> 0.8) with the lower-order GTOM dissimilarities, d T,[0] and d T, [1] . The correlation-based measure d C, [1] is highly correlated (0.79) with d T, [2] . Second, the higher-order GTOM dissimilarities d T, [2] and d T, [3] show a high correlation of 0.78. Third, two GTOM-based dissimilarities are moderately correlated (< 0.4) if their orders differ by 2 or more. Finally, the frequency distribution of d T, [3] is concentrated around 0 while that of the others are concentrated around 1. This illustrates that increasing m leads to increased sensitivity but decreased specificity when defining interconnectedness.

Hierarchical clustering and GTOM plots
In networks involving few nodes, modules can easily be identified by inspecting the network but for large networks involving hundreds of nodes, it is useful to provide a 'reduced' view of the network. For example, one can visualize the topological overlap dissimilarity using classical multi-dimensional scaling plots [27], see the Multidimensional Scaling Plots section. Alternatively, it can be useful to visualize the topological overlap dissimilarity matrix [ ] directly using a TOM plot. As an example, consider the four GTOM plots corresponding to the zeroth-to third-order GTOM in Figure 4. The dataset used here is the same as the one in Figure 2. Red/yellow indicate low/high values of . Both rows and columns of have been sorted using the hierarchical clustering tree. Since is symmetric, the GTOM plot is also symmetric around the diagonal. Since modules are sets of nodes with high (generalized) topological overlap, modules correspond to red squares along the diagonal. Figure 4 shows that modules are more pronounced and larger with increasing values of m. This illustrates that higher values of m increase the sensitivity of measuring interconnectedness at the expense of specificity. This is further discussed in the section on the asymptotic behavior of GTOM below. For comparison purposes, a color bar is shown on the top on each GTOM plot. The color bar is ordered by the respective dendrogram and colored by the GTOM1 module assignment (c.f. Figure 2B.) The fact that the module colors stay together for different choices of m provides evidence that the module assignment is fairly robust with respect to the dissimilarity measure. One advantage of our proposed general class of dissimilarity measures is that they allow one to verify that module assignment is robust with respect to different network dissimilarities. If there is a strong biological signal, one would hope that the results are robust with respect to different choices of statistical methods.
But a more subtle analysis provides indirect empirical evidence of the usefulness of GTOM2 for module definition. Note that a second color bar is included on the left of the heatmap. Here, dark red indicates the membership to the class 'protein biosynthesis'. Genes that belong to other classes (or are unknown) are depicted by a gray color in the bar. We observe that protein biosynthesis genes are grouped together in the GTOM2 and GTOM3 plots. In each column, the top row shows the dendrogram obtained by applying the average linkage hierarchical clustering to the corresponding GTOM dissimilarity, the middle row shows the color bar ordered by the corresponding dendrogram but colored by the module assignment with respect to the TOM measure in B, the bottom shows the color bar ordered by the corresponding dendrogram but colored in dark red if the gene belongs to the class 'protein biosynthesis'. The modules defined by the TOM are the branches of the dendrogram in B at the cutoff 0.95. Almost all protein biosynthesis genes are grouped together by the proposed new TOM measure whereas the other two measures tend to distribute the class over two modules. The modules defined by GTOM2 are more pronounced in the sense that they are separated by larger distances.

Multi-dimensional scaling plots
We visualize the dissimilarity measures using classical multi-dimensional scaling (MDS) plots. Classical multidi-mensional scaling takes as input matrix a dissimilarity matrix (here the GTOM dissimilarity). The result of multidimensional scaling are vectors in a low dimensional Topological overlap matrix plots for the yeast gene co-expression network Modules are more pronounced in the GTOM2 and GTOM3 plots (larger contrast between the diagonal blocks and off-diagonal blocks). Smaller modules (as diagonal blocks of red) are more visible in GTOM0 and GTOM1 plots whereas larger modules are more respected in GTOM2 and GTOM3 plots. However, GTOM3 leads to excessively large modules and thus the specificity of the modules is compromised. Protein biosynthesis genes are grouped together in the GTOM2 and GTOM3 plots.

Pair-wise scatter plots between different GTOMm dissimilarity measures
Euclidean space (here the 2 dimensional Euclidean plane) such that the Euclidean distances between the vectors approximate the dissimilarities. To compute these vectors, an eigenvector problem is solved to find the locations that minimize distortions to the dissimilarity matrix [27].
The MDS plots are shown in Figure 5. All the plots are color-coded according to the modules with respect to GTOM1 depicted in Figure 2. The relative position of the points are well-preserved as we can see that points having the same color are almost always clustered together. Genes that belong to the class 'protein biosynthesis' are depicted by the symbol '▲'. Other genes are denoted by a '❍'. Interestingly, almost all 'protein biosynthesis' genes are in the vicinity of each other. The plot using the GTOM1 dissimilarity in Figure 5B shows a more clear separation between the red and brown modules.
We observe from the MDS plots that there is a tendency of consolidation as the order m of the GTOM measure increases. This phenomenon can be seen in Figure 5D where GTOM3 is used and a few 'sinks' (points of attraction) have been formed.

Robustness of the results to network perturbations
To demonstrate that GTOM2 may outperform GTOM1 and GTOM0 in the case of noisy network data, we randomly removed connections in the yeast gene co-expression network. Specifically, we randomly set a proportion p = 0, 0.1, 0.25, 0.33, 0.5, 0.67, 0.75, 0.9 of entries in the adjacency matrix to 0. We used the perturbed network to compute the corresponding GTOMm measures (for m = 0,1,2,3). To quantify the ability of GTOMm to separate protein biosynthesis genes ( ) from non-protein biosynthesis genes ( ) we defined a measure of separation, which is motivated by the intergroup dissimilarity measures used in average linkage hierarchical clustering. Specifically, we define where is the average GTOMm among protein biosynthesis genes and is the average GTOMm between protein biosynthesis and non-protein biosynthesis genes. Here, | | and | | denote the total number of protein biosynthesis genes and non-protein biosynthesis genes respectively. The higher the value of GTOMdiff(m), the better is the performance of GTOMm in this application. For each probability p, we averaged the results across 20 perturbed versions of the network. The results in Figure 6 demonstrate that high values of m counter the effect of misspecified (missing) adjacencies in this application.

A simple example
An example comparing modules detected by the GTOM1 and GTOM2 similarities is given in Figure 7. As a rule of thumb, if many of the nodes in a module are separated by a distance of 1 or 2 from each other, then they form a tight module with respect to the GTOM1 similarity. Likewise, if many of the nodes in a module are separated by a distance of 3 or 4 from each other, then they form a tight module with respect to the GTOM2 similarity.

The asymptotic behavior of GTOMm for large m
Here we consider the situation when m is larger than or equal to the network diameter, i.e. each pair of nodes can be connected with a path of length ≤ m. In this case, |N m (i)| = n -1 and |N m (i) ∩ N m (j)| = n -2 where n denotes the network size. Then when n is large. This demonstrates that for sufficiently large values of m all pairs of nodes within a connected network component will be highly interconnected. Thus, large and tight modules result when GTOMm with large m is used as input of a clustering procedure, see Figure 4.

Choosing the order m
How to choose the order m is an important question in many applications. While it seems intuitive that the choice of m has some relationship to the network diameter, it is unclear to us how to use the network diameter to guide the choice of m (other than providing an upper bound).
In general, the optimal choice of m will depend on the data quality and the goals of the analysis. Roughly speaking, if the adjacency matrix contains very few errors and if the goal is to determine which nodes are linked to a given node then m = 0 is the obvious choice. But if many adjacencies have falsely been set to 0 (since the corresponding connections are unknown) and/or if the goal is to detect possibly longer ranging interactions then relatively large When an external label is available for at least some of the nodes then one can use it to inform the choice of m. For example, when the external node label y encodes group membership, then one can choose m so that the groups have high within group interconnectedness and low between group interconnectedness. To make this more specific, we assume that there is evidence that the nodes of group 1 are highly interconnected and that they are well separated from nodes of group 0. For example, in our Multi-dimensional scaling plots of the yeast gene co-expression network Figure 5 Multi-dimensional scaling plots of the yeast gene co-expression network. MDS plots using A. GTOM0, B. GTOM1, C. GTOM2, and D. GTOM3. The coloring scheme is used to reflect the 7 modules shown in Figure 2B detected by using hierarchical clustering with the GTOM1-based dissimilarity. The symbol '▲' denotes genes that belong to the functional category 'protein biosynthesis'. Genes that belong to other classes are denoted by a '❍'. In general, the module assignment is preserved across the different GTOM measures. But the spatial distributions of the points vary to a large extent. Genes in the 'protein biosynthesis' class appear to be closer together.
yeast gene co-expression network, we have shown above that the average GTOMm measure between protein biosynthesis genes (group 1) is larger than the average GTOMm measure between protein biosynthesis genes and non-protein biosynthesis genes ( Figure 6).  A simple example where GTOM2 is superior to GTOM1 Figure 7 A simple example where GTOM2 is superior to GTOM1. GTOM neighborhood of size S = 7 around node 1. A. GTOM1 neighbors are colored in black. B. GTOM2 neighbors are colored in black. Note that GTOM2 detects the 'true' neighborhood (comprised on nodes 1 through 8) while GTOM1 misses nodes 6 and 7.

Discussion
Several measures that keep track of shared 1 step neighbors have been proposed in the literature, e.g. [28]. Here, we propose a natural generalization of the widely used topological overlap matrix. This class of new measures is constructed by keeping track of the number of m-step neighbors that a pair of nodes share. The GTOM similarity measure is normalized to take on values in the unit interval. A corresponding dissimilarity measure can be defined by subtracting the GTOM similarity from 1.
While we find it a worthwhile goal for future research to develop statistical or heuristic criteria for choosing m, we find that a main advantage of GTOMm is that it allows one to assess the robustness of network analysis results. In many applications (e.g. module definition, neighborhood analysis), it will be worthwhile to show that the results are relatively robust with respect to m since this indicates that the biological signal is strong. While we present applications where non-standard choices of m lead to superior results, we have found in several other (unreported) applications that the results are robust with respect to m = 0,1, 2. By randomly deleting network adjacencies in the yeast gene co-expression network application, we have shown that large values of m can counter the effect of misspecified (missing) adjacencies. GTOMm becomes uninformative if m is larger than the network diameter. Thus, GTOMm will be useful in networks with moderate or large 'degree of separation' (average path length between any pair of nodes). Since biological networks tend to have low diameters [29], we expect that low values of m will be preferable in most applications. But we have provided two real data applications where m = 2 is preferable over m = 1. In general, the GTOM measures with lower orders m will be useful for discovering small modules while those with higher orders favor the discovery of larger modules.
A limitation of our approach is that it is only defined for unweighted networks, i.e. the entries of the adjacency matrix should be 0 or 1. Another limitation is that we only consider the topological overlap between 2 nodes. A multi-node extension of the GTOM1 measure is presented in [30].

Conclusion
The generalized topological overlap measure can serve as a filter for countering the effect of spurious or missing connections. The order m of the topological overlap measure can serve as a tuning parameter for interconnectedness that trades off sensitivity versus specificity. Since different orders of m probe different neighborhoods, adjusting m allows the user to consider network modules at different 'zoom' levels. We provide additional Materials and Methods as well as the statistical software code, a tutorial along with customized R functions, and the accompanying data files at the web page [31]. Thus, the reader should be able to reproduce all of our findings.

Topological overlap matrix
The topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to. Note that in an unweighted network, the number of shared neighbors of nodes i and j is given by ∑ u≠i,j a iu a uj . Ravasz et al. [1] define the topological overlap measure t ij as follows where l ij = ∑ u≠i,j a iu a uj , k i = ∑ u≠i a iu . An advantage of the quantity 1 -a ij in the denominator is that it prevents the denominator from becoming 0 when the connectivities (degrees) k i and k j are 0. Since a ij ≤ 1, one can easily show that l ij = ∑ u≠i,j a iu a uj ≤ ∑ u≠i a iu -a ij = k i -a ij . It follows that l ij ≤ min(k i -a ij , k j -a ij ) and that the numerator of t ij is smaller than the denominator, i.e. 0 ≤ t ij ≤ 1.
We remark that the definition of TOM given in [1] is slightly different from Eq. (7): (l ij + a ij )/min{k i , k j }. In a personal communication with E. Ravasz, the definition in Eq. (7) is preferred, which is also given in the online supporting material of [1]. The inclusion of the term a ij in the numerator makes t ij explicitly depend on the existence of a direct link between the two nodes in question.

An algorithm for computing GTOM
In this subsection, we present computational formulas for T [m] . In this subsection, we assume that the diagonal of A has been set to 0. Then the ij-th entry of the matrix power A m counts the number of paths of length m connecting nodes i and j [32]. But the paths are not necessarily geodesic and may contain cycles.