In this section we describe our method. Section “Algorithm overview” presents an overview of our algorithm. Section “Joining patterns to find larger patterns” explains the mechanism we use to grow motifs by joining smaller motifs. Section “Finding MIS: Going from F1 to F2” describes how we count disjoint motif instances. Section “Accelerating our algorithm through efficient filters” presents filtering techniques we implement to avoid costly isomorphism tests. Section “Complexity analysis” discusses the complexity analysis of our method.
Algorithm overview
In this section, we provide an overview of our method for discovering motifs. At the heart of our method lie four unique graph patterns. We call them the basic building patterns for we use them as guide to construct larger motifs of arbitrary sizes and topologies. Figure 2 presents these basic building patterns. We explain why we use these four specific patterns in Section “Joining patterns to find larger patterns” in detail.
Algorithm 1 presents the pseudo-code of our method. We elaborate on each key step of our method in subsequent sections. The algorithm takes a graph G, the number of nodes of the target motif μ, and the minimum acceptable motif frequency as input α. For each of the four basic building patterns, it first locates all subgraphs in G that are isomorphic to that pattern (Line 1). Let us denote the set of instances of the ith pattern (i∈ {1, 2, 3, 4}) with S
i
. In each set S
i
, it is possible to have overlapping subgraps. It then extracts the maximum set of edge-disjoint subgraphs in each set S
i
(Line 2) (see Section “Finding MIS: Going from F1 to F2” for details). Let us denote the resulting set with S
i′ for the ith pattern. Notice that the cardinalities of the sets S
i
and S
i′ are the F
1 and F
2 measures of the ith pattern respectively. The union of all the sets S
i′ constitutes the current motif instances as well as the basic building pattern instances at this point (Line 3). The algorithm then iteratively grows the current motif set. At each iteration, it joins the current motif set with the basic building pattern set (Line 9). More specifically, a motif instance and a basic building pattern join if they share at least one edge. Joining two such subgraphs either creates a pattern which already exists in the current set (Line 10) or a new pattern (Line 12). At each iteration, after growing the current set, it filters the overlapping subgraphs to identify MIS for each pattern (Line 18). The algorithm removes all patterns with frequency lower than the user supplied cutoff (Line 21). It reports the frequent subgraphs that have as many edges as the target motif size (Line 23). The algorithm terminates when the current set can not be grown to have any other patterns which satisfy the target motif (i.e. each pattern in the current set is either larger than the target motif size or its frequency is lower than the user specified frequency).
Joining patterns to find larger patterns
Here, we describe one join iteration of our method; the process of joining the subgraphs of current set of patterns with the subgraphs of the basic building patterns to construct larger patterns. At the end of the iteration, the resulting set of subgraphs becomes the current set of subgraphs for the next join iteration.
Recall that we join two subgraphs only if they share at least one edge. Joining two such subgraphs either yields a pattern that is isomorphic to one of the existing patterns or a new one. In the former case, we consider the set of subgraphs S isomorphic to that pattern. We check if the new subgraph is already in S. If it is in S, we discard it. Otherwise, we store it in S. In the latter case (i.e., the pattern is observed the first time), we save this as a new pattern and also keep the corresponding subgraph.
Notice that, although the subgraphs in S do not overlap prior to join, this may no longer hold after new subgraphs are inserted into S. At the end of each join iteration, we select the MIS for each pattern. We defer the discussion on how we do this to Section “Finding MIS: Going from F1 to F2”. We then remove the patterns with F2 values below the user supplied frequency threshold, α. This eliminates non-promising patterns, and thus, reduces the number of candidate patterns for the next join iteration. Using the F2 measure ensures that patterns maintain downward closure property. Thus, non-frequent patterns will never grow to yield frequent patterns.
Why do we need different equivalence classes? If the motif frequency is measured using F1, it is sufficient to join the subgraphs belonging to existing patterns with only those which belong to the same equivalence class of the simple pattern with two edges (see Fig. 2
a) to construct any larger pattern. This however is not true when F2 (or F3) is used to count the motif frequency. To understand the rationale behind this, recall that each equivalence class represents a set of disjoint isomorphic subgraphs. As a result, no two subgraphs from the same equivalence class join for they do not share any edges. Therefore we need more than one equivalence class to construct new and larger patterns.
Given that we need multiple patterns, next, we seek the answer to the following question: What is the smallest set of patterns which can be used to produce arbitrary large topologies by joining them? Here we outline the key steps of the proof that the four basic building patterns, presented in Fig. 2, suffice to construct any larger pattern. That said, we do not guarantee to find all copies of such patterns in the target network.
Before we discuss our induction steps, we explain our strategy on a specific motif size of four to improve the clarity of the discussion on induction. Figure 3 shows all the possible patterns which can be constructed with undirected four edges. A careful inspection shows that each one is an overlapping combination of two of the basic building patterns. For instance, the pattern in Fig. 3
a can result from joining the basic pattern in Fig. 2
a with the basic pattern in Fig. 2
c. It is worth noting that we can construct some of the patterns in Fig. 3 by joining two different pairs of basic building patterns. This redundancy ensures we can still locate a specific pattern even if one of those pairs does not exist. Therefore, our method can construct any pattern with four edges from patterns with three or two edges.
We conduct our proof for the arbitrary pattern size by induction.
Basis The four basic patterns in Fig. 2 constitute all possible graph topologies with two or three edges.
Induction step We assume that our method can construct any pattern with up to k edges (k≥ 3). We next show that any pattern with k+1 edges can be constructed by joining a pattern with k edges with one of the basic building patterns.
Recall that the downward closure property states that those smaller patterns have at least as much frequency as the larger one according to F2 (see Theorem 1.1). This means that if a pattern with k+1 edges is frequent, then so is any of the k edge patterns obtained by removing an edge from that pattern.
Consider a graph G and a copy of a pattern P1 of size k edges in G, S
1. Also, consider a copy of a pattern P2 with k+1 edges such that P2 contains P1 and one additional edge. Let us denote this additional edge with (a, b). We need to show that P2 can be obtained from P1 by joining it with at least one of the basic patterns.
Since both P1 and P2 are connected graphs, at least one of the two nodes a and b has an edge in P1. Without violating the generality of the proof, let us assume that b has an edge (b,c) in P1. Figure 4
a illustrates the two edges (a, b) and (b, c).
First, we consider using the basic pattern M1 in Fig. 2
a in the join operation. In this case, a copy of M1, {(a,b),(b,c)} will join with S
1 having a common edge (b,c) which will result in the pattern P2 with k+1 edges. This join however occurs only if the subgraph {(a,b),(b,c)} is included in the F2 counts of M1 (i.e. within the chosen non-overlapping copies of M1).
If this condition fails, we consider the degrees of the two nodes b and c in pattern P1. We start with node c. Let us denote the degree of a node with function d
e
g() (e.g. d
e
g(c) is the degree of node c in pattern P1).
If d
e
g(c)>1, then c has at least one more edge on top of (b,c). Let us denote this edge with (c, d) (see Fig. 4
b). In this scenario, we join a copy of the motif M4 (Fig. 2
d), {(a,b),(b,c),(c,d)} (if this copy exists in the F2 count of M4) to obtain P2.
Finally, if d
e
g(c)=1, it is guaranteed that d
e
g(b)>1. This is because if both nodes b and c have degree one, S1 cannot be a connected subgraph. Let us denote one of the additional edges of b with (b,d) (see Fig. 4
c). In this case, we join the subgraph that isomorphic to the pattern M3, {(a,b),(b,c),(b,d)}, with S
1 to obtain P2. We can do this if this copy exists in the F2 count of M3.
In summary, we conclude that any pattern P2 with k+1 edges can be constructed by joining a pattern P1 with k edges (or k−1 edges) and one of the basic building patterns to obtain the additional edge (or edges) if at least one of the many possible scenarios hold. We however cannot guarantee that the joins will find all of the instances of the k+1 edge pattern on the target graph.
Recall that as we aim to calculate the frequency of a given motif using F2, there is no self join of any pattern. Thus, the basic building patterns set is the smallest set of patterns as we can not construct one of those four patterns using the three other patterns. More specifically, this means that we can not use only one of those four basic building patterns to construct larger patterns by joining pairs of subgraphs belong to that pattern’s equivalence class. This is because if we join the embeddings of a single motif topology (such as the first pattern in Fig. 2
a) we cannot get any larger pattern as they do not share any edge(s).
Finding MIS: Going from F1 to F2
Here, we explain how we compute the F2 frequency for a given pattern. We use two algorithms for this purpose. We explain why we have two separate algorithms later in this section after describing the two algorithms. The first one is a heuristic used in the literature [16]. This algorithm constructs a new graph, called the overlap graph for each pattern as follows. Each node in the overlap graph of a pattern denotes an embedding of that pattern in the target graph. We add an edge between two nodes of the overlap graph if the corresponding embeddings represented by those nodes overlap in the original graph. Once the overlap graph is constructed, the algorithm starts by selecting the node with the minimum degree (i.e. overlaps with the minimum number of embeddings) in the overlap graph. We include the subgraph represented by this node in the edge-disjoint set. We then delete that node along with all of its neighboring nodes in the overlap graph. We update the degree of the neighbors of the deleted nodes. We repeat this process of picking the smallest degree node and shrinking the overlap graph until the overlap graph is empty.
The algorithm described above works well for patterns with small number of embeddings. It however becomes computationally impractical as the number of embeddings of the underlying pattern gets large. This is because both constructing the overlap graph (particularly identifying its edges) and updating it are computationally expensive tasks. Therefore, we use this algorithm for all patterns except for the basic building patterns (where number of embeddings are often too large).
The second algorithm addresses the scalability issue of the the first one. This scalability issue is imposed by the expensive task of calculating the degree of each node in the overlap graph (i.e. the number of overlaps of each embedding). Recall from the previous algorithm that this number is considered as a loss value when selecting the node (i.e. embedding) with minimum degree (i.e. number of overlaps) to include in the final MIS of the pattern under consideration. Briefly, the second algorithm we introduce here avoids the expensive task of calculating number of overlaps for each embedding. The algorithm performs this by algebraically computing such numbers instead of performing actual overlapping tests. Once we compute node degrees of the overlap graph, this algorithm selects the disjoint embeddings the same way as the former algorithm described before. More specifically, the algorithm selects the node with the minimum degree and includes its corresponding embedding in the final MIS. It then removes neighboring nodes to that node from the overlap graph. It repeats this process until the overlap graph is empty. Next, we explain how we compute the degree of a node in the overlap graph for the pattern M1 in Fig. 2
a. Our computation is similar for the other three basic building patterns, yet tailored towards their specific topologies (derivation is shown in Additional file 1: Appendix). Figure 5 shows a hypothetical subgraph S
1={ (a,c), (b,c)} in the input graph G which is isomorphic to M1. This subgraph is represented by a node in the overlap graph of M1’s embeddings. Let us denote the degree of a node in the original graph G with function d() (e.g. d(v
i
) is the degree of node v
i
). Another embedding of M1 in G overlaps with S
1 only if it contains the edge (a,c), or (b,c). Any edge in G connected to the middle node c forms two overlapping embeddings, one with the subgraph that has edge the (a,c) and the other with the subgraph that has the edge (b,c). We exclude the edges belong to S
1 (i.e. the embedding we want to calculate its number of overlaps) itself from the potential edges of G that considered in the overlapping embeddings with S
1. Thus, by excluding the two edges (a,c) and (b,c) from c’s degree, node c yields 2 × (d(c) - 2) overlaps. In addition, any edge that belongs to node a forms an embedding when combined with the edge (a,c). Excluding the edge (a,c), node a yields d(a) - 1 overlaps. Similarly, node b produces d(b) - 1 overlaps. Thus, the total number of overlaps for the embedding S
1 = { (a,c), (b,c)} combined from edges of its three nodes { (a,b,c)} is
$$ 2(d(c) - 2) + d(a) - 1 + d(b) - 1 = 2d(c) + d(a) + d(b) - 6 $$
Notice that unlike the first algorithm, the second one requires a unique derivation for each pattern. Thus, we apply it only to the basic building patterns, for their topologies do not depend on the input graph. Also, it is worth noting that typically the basic building blocks have much larger number of embeddings as compared to the patterns derived by joining them. Thus, the efficiency of the second algorithm is needed for them more than the patterns obtained in subsequent iterations (see experimental results).
To adapt our method to count non-overlapping embeddings of each pattern according to F3 instead of F2, we only need to change how we calculate the MIS of this pattern. More specifically, we change the criteria which states that “two subgraphs overlap if they share at least one edge” to “two subgraphs overlap if they share at least one node” (see Section “Definitions and notation”). This will result in changing the overlap graph constructed using the first method we explain in this section. In addition, it will also have slight change in calculating the total number of overlap of each embedding using the second method we discuss in this section. Practically, we expect the overlap graph to be denser when we use the F3 measure as compared to that for the F2 measure. To illustrate this, consider the graph G in Fig. 1
a and the pattern in Fig. 1
c. This patter have 3 embeddings in G which are S
1, S
2, and S
3 defined by the set of edges {(a,b), (a,c), (b,c), (b,e)}, {(e,f), (f,g), (e,g), (e,d)}, {(e,f), (f,g), (e,g), (b,e)} respectively. Figure 6
a and Fig. 6
b represent the overlap graph of this pattern based on F2 and F3 measures respectively.
Accelerating our algorithm through efficient filters
Recall that at each iteration, our algorithm generates new subgraphs. For each of these subgraphs, it checks if this subgraph is isomorphic to one of the patterns constructed till that iteration. Isomorphism test is a computationally expensive task. Next, we describe how we avoid a large fraction of these tests.
We develop two canonical labeling strategies for patterns. Canonical labeling assigns unique labels to the nodes of a given pattern [27]. If two patterns are isomorphic, then they have the same canonical labeling. The inverse is however not true. Unlike isomorphism test, comparing the canonical labeling is a trivial task. Following from this observation, when we construct a new subgraph, we first compare its canonical labeling to those of existing patterns. We then limit the costly isomorphism test to only those patterns which have the same canonical labeling as the new subgraph.
The first canonical labeling counts the degree (i.e. number of incident edges) of each node in the given pattern. It then sorts those degrees and keeps them as a vector we call the degree vector. If two patterns have different degree vectors, then they are guaranteed to have different topologies. Despite its simplicity, this labeling filters out a large fraction of patterns. To test its efficiency, we have tested it on random graphs generated using Barabási −Albert model [28]. We generate 1000 pairs of graphs where each pair is non-isomorphic and have the same number of nodes and edges. The degree vector successfully filters 85 % of the 1000 experiments.
The second canonical labeling extends on the first one. It was first introduced by [29]. Consider a pattern P=(V, E). Let us define the distance between two nodes v
i
, v
j
∈V as the number of edges on the shortest path that connects v
i
and v
j
and denote it with x
ij
. Let us define the diameter of P as the maximum distance between any two nodes, and denote it with x. Using this notation, we assign label to node v
i
as: \(\sum _{j}^{j \in V} 2^{x-x_{ij}-d(v_{j})}\). Once we compute the labels of all the nodes in the given pattern, we sort them. We call the resulting vector the nodes vector. Similar to the first labeling above, two isomorphic graphs are guaranteed to yield the same labeling. We compute and compare the nodes vector with only the patterns which cannot be eliminated using the first canonical labeling. We then consider the patterns with identical canonical labels for graph isomorphism.
Complexity analysis
Here we analyze the complexity of our method. We refer to Algorithm 1 as we discuss the steps of our method. For each steep, we explain its complexity. We then summarize the complexity of all steps to denote the overall complexity of our method. These steps are
-
Find all subgraphs isomorphic to each of the four basic patterns (Line 1): In this step, we analyze each of the four basic patterns separately since they have different topologies. For the pattern M1 in Fig. 2
a, to get all subgraphs isomorphic to this pattern, we consider all edges connected to each node in the underlying network. We select any two edges combination connected to every node. Here, we denote the degree of a node with function d() (e.g. d(v
i
) is the degree of node v
i
). Thus, the complexity of collecting subgraphs that are isomorphic to M1 is \({\sum \nolimits }_{v_{i} \in V} {{d(v_{i})}\choose {2}}\). Similarly, for the pattern M3 in Fig. 2
c, we select any three edges combination connected to each node in G. Thus, the complexity of constructing subgraphs which are isomorphic to M3 is \({\sum \nolimits }_{v_{i} \in V} {{d(v_{i})}\choose {3}}\). For the pattern M2 in Fig. 2
b, we consider each edge e
ij
in G with two nodes v
i
and v
j
. We collect edges of both nodes. We then select one edge connected to v
i
and one edge connected to v
j
(on the condition that these two edges are connected from the other end) along with e
ij
to form a subgraph isomorphic with M2. Thus, the complexity of constructing subgraphs that are isomorphic to M3 is \({\sum \nolimits }_{e_{ij} \in E} d(v_{i}) d(v_{j})\). Similarly to M2, we perform the same operation to get isomorphic subgraphs to the pattern M4 in Fig. 2
d. Only this time we make sure that the two edges belong to v
i
and v
j
are not connected with each other from the other end. Thus, the complexity of constructing subgraphs that are isomorphic to M4 is \({\sum \nolimits }_{e_{ij} \in E} d(v_{i}) d(v_{j})\). Collectively, the complexity of performing this step is \(\mathcal {O}({\sum \nolimits }_{v_{i} \in V} d(v_{i})^{3} + {\sum \nolimits }_{e_{ij} \in E} d(v_{i}) d(v_{j}))\). Notice that, theoretically, the worst case scenario happens when \(d(v_{i}) = \mathcal {O}(n)\). In this scenario, the complexity of this step becomes \(\mathcal {O}(n^{4})\).
-
Extract maximum disjoint set for basic patterns (Line 2): In this step, we use the algebraic algorithm described in Section “Finding MIS: Going from F1 to F2” (second one) to calculate the number of overlaps of each subgraph belonging to each pattern equivalence class. This process takes constant time. We calculate this algebraic equations as we construct subgraphs in the previous step. We then sort those subgraphs within each equivalence class in decreasing order of their number of overlaps. This process has complexity equal to \(\mathcal {O}(mlog(m))\) where m is the number of subgraphs in each equivalence class. Recall from previous step that this number is \(\mathcal {O}\left ({\sum \nolimits }_{v_{i} \in V} d(v_{i})^{3} + {\sum \nolimits }_{e_{ij} \in E} d(v_{i}) d(v_{j})\right)\). Thus, the complexity of this step is \(\mathcal {O}\left (\left ({\sum \nolimits }_{v_{i} \in V} d(v_{i})^{3}\right)\right.\)
\( log\left ({\sum \nolimits }_{v_{i} \in V} d(v_{i})^{3}\right) + \left ({\sum \nolimits }_{e_{ij} \in E} d(v_{i}) d(v_{j})\right) \)
\(\left.log\left ({\sum \nolimits }_{e_{ij} \in E} d(v_{i}) d(v_{j})\right)\right)\).
-
Join Iterations (Lines 5–27): In this step, we analyze the complexity of one join iteration. We then summarize the complexity of all join iterations. Let us denote the number of current patterns in iteration i with x
i
. Notice that, for the first iteration x
i
=4. Recall that in each join iteration, we increase the size of each of the current patterns with one or two edges. In addition, the patterns of the first join iteration are at least of size 2. Thus, the size (i.e. number of edges) of each of the current patterns in iteration i is at least i+2. The number of subgraphs isomorphic to each of the current patterns is at most \(\frac {|E|}{i+2}\) since they are non-overlapping subgraphs. Recall that the subgraphs of the basic patterns are non-overlapping within each pattern. Thus, the number of subgraphs of the patterns M1, M2, M3, and M4 are \(\frac {|E|}{2}\), \(\frac {|E|}{3}\), \(\frac {|E|}{3}\), and \(\frac {|E|}{3}\) respectively. Collectively, the number of subgraphs of the basic patterns is \(\mathcal {O} (|E|)\).
In the join iteration, we start by joining subgraphs of current patterns with the subgraphs of the basic patterns (Lines 6–9). Thus, the total number of joins we perform at iteration i is \(\mathcal {O}\left (|E| \frac {|E|}{i+2} x_{i}\right)\). For each join, we compare the resulting subgraph against all patterns (Line 10). Recall that, we use filters to avoid this costly isomorphism check (see Section “Accelerating our algorithm through efficient filters”). Thus, the complexity of this operation is \(\mathcal {O} (x_{i})\). If this subgraph is isomorphic to one on the current patterns, we check whether this subgraph is a duplicate of one of the subgraphs which already exists in this equivalence class (Line 11). We search an indexed list of those subgraphs in \(\mathcal {O} \left (log\left (\frac {|E|}{i+2}\right)\right)\). Collectively, we obtain the complexity of performing all joins at iteration i by multiplying the three complexities above and get \(\left (|{E}| \frac {|E|}{i+2} x_{i} x_{i} log\left (\frac {|E|}{i+2}\right)\right), \mathrm {which~equals}\mathcal {O}\left ({x_{i}^{2}} \frac {|E|^{2}}{i+2} log\left (\frac {|E|}{i+2}\right)\right)\).
Upon completing all join operations, our algorithm extracts the MIS for each pattern (Line 18) using the overlap graph algorithm described in Section “Finding MIS: Going from F1 to F2” (first one). Notice that we perform this operation for the new set of patterns, x
i+1 (current patterns of next iteration) for which the number of patterns is at most \(\frac {|E|}{i+3}\) (This is because each pattern is of size i+3 and no two patterns overlap). For each pattern, we collect the overlapped subgraphs of each subgraph in \(\mathcal {O}\left (\left (\frac {|E|}{i+3}\right)^{2}\right)\). We then sort the subgraphs in decreasing order of their number of overlaps in \(\mathcal {O}\left (\frac {|E|}{i+3} log\left (\frac {|E|}{i+3}\right)\right)\) time. Thus we extract the MIS for all patterns in \(\mathcal {O}\left (x_{i+1} \left (\frac {|E|}{i+3}\right)^{3} log\left (\frac {|E|}{i+3}\right)\right)\).
Finally, we check each resulting pattern (Line 19–25) and delete it if its frequency is less than the threshold α. We perform this step in \(\mathcal {O}(x_{i+1})\) time.
Recall that in each join iteration, we increase the size of each of the current patterns with one or two edges. Also recall that we start the with patterns of at least of size 2. Thus, total number of join iterations we perform until we reach to all patterns are at least of the target motif size is μ−2. Thus, the complexity of all join iterations is \(\mathcal {O}\left (\sum \limits _{i=1}^{\mu - 2} \left ({x_{i}^{2}} \frac {|E|^{2}}{i+2} log\left (\frac {|E|}{i+2}\right) + x_{i+1} \left (\frac {|E|}{i+3}\right)^{3} log\left (\frac {|E|}{i+3}\right)+ x_{i+1}\right)\right)~\mathrm {or~simply} \mathcal {O}\left (\sum \limits _{i=1}^{\mu - 2} \left [x_{i} \frac {|E|^{2}}{i} log\left (\frac {|E|}{i+2}\right)\right ] \left [x_{i} + \frac {|E|}{i^{2}}\right ] + x_{i+1}\right)\)
In summary, the complexity of our method considering all the previous steps is
$$\begin{array}{*{20}l} & \mathcal{O}\left(\left({\sum\nolimits}_{v_{i} \in V} d(v_{i})^{3}\right) \left(1+ log\left({\sum\nolimits}_{v_{i} \in V} d(v_{i})^{3}\right)\right)\right. \\ &\quad+\! \left({\sum\nolimits}_{e_{ij} \in E} d(v_{i}) d(v_{j})\right) \left(1+ log\left({\sum\nolimits}_{e_{ij} \in E} d(v_{i}) d(v_{j})\right)\right) \\ &\quad\left. + \sum\limits_{i=1}^{\mu - 2} \left(\left[x_{i} \frac{|E|^{2}}{i} log\left(\frac{|E|}{i+2}\right)\right] \left[x_{i} + \frac{|E|}{i^{2}}\right] + x_{i+1}\right)\right) \end{array} $$
Notice that x
i
here depends significantly on the topology and the density of the given network G. To the best of our knowledge, there is no closed formula that calculates x
i
(i.e. the number of unique topologies of certain size in a given graph G).