Graph definitions
Formally, a network (or graph) is a collection of nodes (or vertices) together with a collection of edges joining pairs of nodes. Typically, graphs are denoted G = (V, E) where V = V(G) is the set of nodes and E = E(G) is the set of edges. We consider directed graphs, where the endpoints of edges have an order, that is, an edge (i, j) ∈ E(G) is different from (j, i) ∈ E(G). If both edges (i, j) and (j, i) occur in G, we describe them together as a bidirectional edge. An (i, i) edge, from a node to itself, is known as a loop or a self-loop. Usually, in network motif studies, input networks are considered after all self-loops are discarded.
A subgraph of G = (V, E) is a graph H = (V′, E′) for which V′ ⊆ V and E′ ⊆ E and the edges in E′ have endpoints in V′. A subgraph is called induced if E′ = {(i, j) ∈ E : i ∈ V′and j ∈ V′}, that is, it includes every edge in the original graph whose endpoints are both in V′. A graph is called connected if there is a path between any two vertices (ignoring edge directions).
Note that in our analysis we exclusively consider induced subgraphs, since although some methods (such as MODA and NeMo as noted below) are capable of sampling non-induced subgraphs, most widely-used methods [28, 29], including Mfinder, FANMOD [9], and NetMODE [30] sample only induced subgraphs.
Datasets
The following networks are included in this work.
Transcription factor (TF) networks:
◦ E. coli (sourced from RegulonDB [31]); 1620 non-isolated nodes, 3885 directed edges.
◦ S. cerevisiae (yeast) TF network (sourced from Uri Alon’s website http://www.weizmann.ac.il/mcb/UriAlon/); 688 non-isolated nodes, 1079 directed edges.
The E. coli metabolic network (We use the version packaged with Kavosh [32], where it is attributed to Kyoto Encyclopedia of Genes and Genomes (KEGG) [33]); 672 non-isolated nodes, 1273 directed edges.
Network motif detection software
Many packages have been developed that enable some kind of motif detection. Mainstream packages can perform (what is sometimes called) a k-node subgraph census, that is, they can perform Steps 1–3 of the algorithm described in the Background section simultaneously for all k-node subgraphs. These packages include Mfinder [8], MAVisto [5, 11], FanMod [10, 34, 35], Kavosh [32], G-Tries [36], and NetMODE [30].
FanMod and Mfinder are packages that are widely used for biological research today, although it is also commonplace for researchers to write their own purpose-built software; see, e.g., [37, 38]. Usually input networks are assumed to be loop-free directed graphs, and a search through all (or a sample of) induced k-node subgraphs is performed in both the input network and the ensemble of comparison networks.
Variations of the subgraph census theme are common. For example, NeMoFinder [37] and MODA [39] only treat undirected graphs; MODA and NeMo [40] can perform a k-node subgraph census for non-induced subgraphs; RAGE [41, 42] focuses on 4-node subgraphs.
All experiments in this paper were performed using NetMODE, whose method for allotting Ω is as described in the Background section. In all cases, we used ∣Ω ∣ = 10,000. The reason we use NetMODE is because it has a “verbose mode” which allows studying the intermediate results (along with the final Z-scores, etc).
Generation methods for random similar graphs
The configuration model
The configuration model (sometimes called the “stubs method”) is an algorithm for sampling uniformly at random from \( \mathbbm{S}(G) \) that works by sampling uniformly at random from a superset T ⊇ S(G), then sampling again if the original sample does not belong to \( \mathbbm{S}(G) \) (See, e.g., [70, Ch. 13]). The method begins with a graph G with m edges and n nodes, whose degree sequence is D = {(ci, di)}, where ci and di are respectively the out-degree and in-degree of node i ∈ {1, 2, …, n}. The aim is then to generate a new graph G′ which is sampled uniformly from \( \mathbbm{S}(G) \). The algorithm achieves this as follows:
Step 1: Allot a random permutation β of {1, 2, …, m }.
Step 2: Create a bipartite graph B with m + m nodes given by the bipartition {i1, i2, …, im} ∪ {j1, j2, …jm}, and edges determined by β such that for each v ∈ {1, 2, …, m}, an edge is added from iv to jβ(v).
Step 3: From B, generate a candidate target graph G′ by identifying the following vertices based on the input degree sequence D:
\( {i}_1,{i}_2,\dots, {i}_{c_1} \) and \( {j}_1,{j}_2,\dots, {j}_{d_{1.}} \)
\( {i}_{c_1+1},{i}_{c_1+2},\dots, {i}_{c_1+{c}_2} \) and \( {j}_{d_1+1},{j}_{d_1+2},\dots, {j}_{d_1+{d}_2} \),
and so on.
Step 4: Repeat Steps 1–3 until \( {G}^{\prime}\in \mathbbm{S}(G) \)
Step 5: Return the target graph G′.
Steps 2 and 3 are illustrated in an example in Fig S9. The configuration model ensures that networks are sampled uniformly and independently, but in most real-world instances it is impractically slow.
The switching method
The switching method generates an ensemble, Ω, from a graph, G. At its core is the function described below, which generates a graph G′ from a graph G, such that G′ is in \( \mathbbm{S}(G) \). We use the notation x∈uarX to indicate that x is sampled uniformly at random from the set X.
A single iteration of the main loop in GraphSwitch performs the switch illustrated in Fig. 8. This is then repeated per each vertex in G, until creating G′. In order to create an ensemble, we take the network G0 to be G, the network G1 to be the output of GraphSwitch on G0, the network G2 to be the output of GraphSwitch on G1, and so on. The ensemble Ω is chosen to be {G30003, G30006, G30009, …}, with the index of the last element chosen according to the desired size of Ω. This algorithm, with the constants described here, is implemented in both Kavosh and NetMODE. The algorithm used in FanMod is similar, but instead of attempting a switch per vertex, it attempts a switch per edge.
The switching method is practical for real-world instances, but, unlike the configuration model, it does not result in a uniform sample from \( \mathbbm{S}(G) \), does not guarantee independence between elements of Ω and, in fact, does not always provide a positive sampling probability for all elements in \( \mathbbm{S}(G) \) (See [21] for further discussion). Figure 3 demonstrates that these drawbacks are not merely theoretical, but have wide-ranging implications on practical use of the method.
As discussed in the Results section, the outputs of the switching method are frequently not reproducible across trials. To illustrate why this is the case, consider a graph \( \mathcal{G}=\left(\mathbbm{S}(G),\right\{\left(x,y\right):x\in \mathbbm{S}(G) \) and y is a possible output of GraphSwitch(x)}). This graph defines the topology over which GraphSwitch performs a random-walk. It is clear that sequentially allotted elements of Ω are highly correlated because they are close to each other in the distance metric implied by \( \mathcal{G} \). A more subtle point is that the topology of \( \mathcal{G} \) may be replete with small clusters that have little inter-connectedness with other clusters. When this is the case, GraphSwitch will tend to sample all of Ω from only a handful of clusters. In such a case, the high correlation between elements in Ω will not be restricted to sequential elements, but will also manifest itself in elements separated by many switches. This tendency explains the significant divergence between the histograms in Fig. 3.
Even with ∣Ω∣ = 10000, histograms of nR(H) are demonstrably non-repeatable, with the estimates of μ and σ varying widely between runs. In the figure, the p-value for nG(Z) is estimated once at 0.011 and once at 10−29. The reason for this is that although the expectation of the p-value is the same whether or not the samples are independent, the correlation between draws makes the effective size of Ω much smaller than 10,000, so the variance of the p-value estimate is unusably large.