ProMotE: an efficient algorithm for counting independent motifs in uncertain network topologies

Background Identifying motifs in biological networks is essential in uncovering key functions served by these networks. Finding non-overlapping motif instances is however a computationally challenging task. The fact that biological interactions are uncertain events further complicates the problem, as it makes the existence of an embedding of a given motif an uncertain event as well. Results In this paper, we develop a novel method, ProMotE (Probabilistic Motif Embedding), to count non-overlapping embeddings of a given motif in probabilistic networks. We utilize a polynomial model to capture the uncertainty. We develop three strategies to scale our algorithm to large networks. Conclusions Our experiments demonstrate that our method scales to large networks in practical time with high accuracy where existing methods fail. Moreover, our experiments on cancer and degenerative disease networks show that our method helps in uncovering key functional characteristics of biological networks.

probabilistic network. Otherwise, it is a deterministic network. In the rest of the paper, we represent a probabilistic network using a graph denoted with G = (V , E, P), where V denotes the set of interacting molecules, E denotes the interactions among them, and P : E → (0, 1] is the function that assigns a probability value to each edge. Several approaches have been developed to solve the network motif discovery problem (e.g., [17][18][19]). However, most of them focus on deterministic network topologies. The main reason behind this limitation is that a probabilistic network summarizes all deterministic networks generated by all possible subsets of interactions. Thus, a probabilistic network G = (V , E, P) yields 2 |E| deterministic instances. The exponential growth of the number of deterministic instances makes it impossible to directly apply existing solutions to probabilistic networks. Relatively little research has been done on finding motifs in probabilistic networks. Tran et al. [20] proposed a method to derive a set of formulas for count estimation. This study however has not provided a general mathematical formulation for arbitrary motif topologies. It rather requires a unique mathematical formulation for each motif. Besides, it assumes that all interactions of the probabilistic network have the same probability. Thus, it fails to solve the generalized version of the problem where each interaction takes place with a possibly different probability. Todor et al. [21] developed a method to solve the generalized version of the problem. It computes the exact mean and variance of the number of motif instances. Both of above two methods count the maximum number of motif instances using F 1 measure, that is including all possible embeddings regardless of whether they overlap with each other or not.
There are two more restrictive frequency measures, F 2 and F 3 , which avoid reuse of graph elements [19]. F 2 measure considers that two embeddings of a motif overlap if they share an edge. F 3 measure is more restrictive as it defines overlap as sharing of a node. These two measures count the maximum number of non-overlapping embeddings of a given motif. We explain the difference among three frequency measures on a hypothetical deterministic network G o (see Fig. 1a). Consider the motif pattern M in  Fig. 1c-h). Since F 1 measure counts all possible embeddings, the F 1 count is six. As embeddings H 1 and H 6 do not have common edges, the F 2 count is two. All pairs of embeddings in this set share nodes. As a result, the F 3 count is one. F 2 and F 3 measures satisfy a fundamental characteristic, the downward closure property, which F 1 measure fails to have. This property is essential for constructing large motifs [22]. It ensures that the frequency of network motifs is monotonically decreasing with increasing size of motif patterns. For example, in the deterministic network G o (see Fig. 1a), given the triangle pattern (see Fig. 1i), there are two triangle embeddings in total (Fig. 1j). Consider a larger motif pattern, such as the pattern in Fig. 1b. The F 1 count however becomes six, which conflicts with the downward closure property. Besides, non-overlapping motifs are needed in navigation methods such as folding and unfolding of the network [23]. the importance of non-overlapping motifs into account, Sarkar et al. [24] developed a method to count the nonoverlapping motifs in probabilistic networks using the F 2 measure. Their study builds a polynomial to model the distribution of the number of motif instances overlapping with a specific embedding of that motif. However, the exponential growth of the size of polynomial terms makes it not scalable to large networks.

Contributions.
In this paper, we develop a scalable method, named ProMotE (Probabilistic Motif Embedding), to tackle the problem of counting independent motifs in a given probabilistic network. We formally define the problem in "Preliminaries and problem definition" section. We explain our method for the F 2 measure, yet the same algorithm can trivially be applied to the F 3 measure. This study has three major contributions over the existing literature: (1) The key bottleneck in counting motifs in probabilistic networks is computing the distribution of the number of overlapping embeddings of a given motif instance. We build a new method which allows us to avoid computing this distribution whenever possible. (2) Computing the distribution in (1) above necessitates constructing a polynomial. We devise two strategies, which compute bounds to the overlapping motif count distribution prior to constructing the entire polynomial. These bounds enable us to terminate the costly computation of the distribution whenever possible.
(3) We develop a new strategy which allows multiplication of arbitrarily large polynomials using a limited amount of memory. Our experimental results demonstrate that our algorithm is orders of magnitude faster than existing methods. Our results on cancer and disease networks suggest that our method can help in uncovering key functional characteristics of the genes participating in those networks.
We organize the rest of the paper as follows. We present our algorithm in "Methods" section. We discuss our experimental results in "Results and discussion" section and conclude in "Conclusions" section.

Methods
In this section, we present our method, ProMotE. First, we formally define the independent motif counting problem in probabilistic networks ("Preliminaries and problem definition" section). We next summarize the method by Sarkar et al. [24] ("Overview of the existing solution" section ). We then present the method developed in this paper. Our method introduces three strategies (Sections "Avoiding loss computation", "Efficient polynomial collapsation" and "Overcoming memory bottleneck"), which help us scale to large network size, for which existing methods fail.

Preliminaries and problem definition
In this section we present basic notation needed to define the problem considered in this paper. We denote the given probabilistic network and motif pattern with G = (V , E, P) and M respectively. For each edge e i in G, we denote the probability that e i is present and absent with p i and q i respectively (i.e., p i + q i = 1). We denote the set of all possible deterministic network topologies one can We denote a specific deterministic network which inherits all nodes and edges from G but assume that all of its edges exist with G = (V , E). Figure 2 depicts a probabilistic network and its three possible deterministic networks (i.e., in total there are 2 8 = 256 deterministic networks). We denote the probability of observing a specific deterministic network G o ∈ D(G) with Given a deterministic network G o = (V , E o ) and a motif pattern M, we represent the set of all its embeddings with H(M|G o ). We construct the overlap graph for H(M|G o ), denoted withḠ o , by representing each embedding H k ∈ H(M|G o ) as a node and inserting an edge into two nodes if their corresponding embeddings share at least one edge. Thus, for a specific embedding H k , the degree of its corresponding node inḠ o equals the number of embeddings overlapping with H k . Figure 3 depicts the overlap graph of the embeddings found in deterministic network G o shown in Fig. 1 Consider a specific embedding H k in G. Because of the uncertain nature of the probabilistic network, each embedding exists with a probability value. As a result, the number of embeddings overlapping with H k is also uncertain. We represent it using a random variable B k . To calculate the distribution of B k , we construct a bipartite graph denoted with G k = (V 1 , V 2 , E). V 1 and V 2 represent two node sets, and E represents the edges connecting nodes of V 1 with those of V 2 . Each neighboring node of H k in the overlap graph corresponds to a node in V 1 . Each edge in the edge set, which constitutes all those overlapping embeddings of H k , corresponds to a node in V 2 . Notice that this edge set excludes the edges of embedding H k itself. An edge exists between nodes u ∈ V 1 and v ∈ V 2 if the corresponding embedding of node u has the edge denoted by v. Figure 4 shows the bipartite graph G 4 of embedding H 4 in G o (see Fig. 1 To help better understand this paper, we introduce another two notations x-polynomial and collapse operator. Given a bipartite graph G k , we compute a polynomial, called the x-polynomial as follows. For each node v i ∈ V 1 , it defines a unique variable x i . For each node v j ∈ V 2 , the probability that v j 's corresponding edge is present and absent is p j and q j (q j = 1 − p j ) respectively. For each node v j ∈ V 2 , we construct a polynomial called edge polynomial Z j as Fig. 4 The bipartite graph G 4 of the embedding H 4 . Each x i denotes the variable for each node in V 1 . Each Z j represents the edge polynomial for each node in V 2 The first term of this edge polynomial consists of the product of the variables of those overlapping embeddings containing this edge. The second term only has the probability of the absence of this edge. We explain the concept of edge polynomial using the example of the bipartite graph in Fig. 4. In this example, the edge polynomial for edge e 1 is Z 1 = p 1 x 1 + q 1 . Also the edge polynomial corresponding to e 2 is Z 2 = p 2 x 1 x 2 x 3 + q 2 . The first term of this edge polynomial represents the case that when edge e 2 is present, it contributes to the existence of embeddings H 1 , H 2 and H 3 with a probability p 2 . The second term however represents the case that when edge e 2 is absent with probability q 2 , none of those three embeddings exist. We compute the x-polynomial of H k denoted with Z H k as The key characteristic of the x-polynomial in the above equation is that its terms model all possible deterministic network topologies for the edges denoted by V 2 . We write the jth term of the x-polynomial as α j v i ∈V 1 x c ij i , where α j is the probability and c ij is the exponent of the variable x i . To compute this polynomial faster, we introduces a collapse operator for each variable x r denoted with φ r (), as follows. Let us denote the degree of Using these notations, for the jth term of the x-polynomial, we compute collapse operator φ r () as Notice that, the collapse operator φ r only changes the variable x r . It either replaces it with t or completely removes it depending on the outcome of ψ r (). When ψ r () = 1 (i.e., c rj = deg(v r |G k )), it means that all edges of embedding H r are present (e.g., H r exists). Thus, the variable t replaces x r which means a motif is present. When ψ r () = 0 , it indicates that at least one edge of H r is absent. Thus, the entire H r is missing. For example, consider one of the terms resulting from the product of all edge polynomials in Z H 4 , q 1 p 2 p 4 q 8 x 2 1 x 2 2 x 2 3 x 5 . If we apply the collapse operator φ 1 () to this term, the variable x 1 will be removed as ψ 1 () = 0 (deg(H 1 |G 4 ) = 3 while the exponent of x 1 in this term is 2). Similarly, if we apply the collapse operator φ 2 () to this term, the variable x 2 will be replaced with t as ψ 2 () = 1 (deg(H 2 |G 4 ) = 2 and the exponent of x 2 in this term is also 2). After applying all collapse operators to this term, it becomes q 1 p 2 p 4 q 8 t 3 which indicates that when only edges e 2 and e 4 are present, there are three embeddings present. And this case happens with a probability q 1 p 2 p 4 q 8 . We apply the collapse operator ψ r to the polynomial terms as soon as it completes multiplication of the final edge polynomial of the variable x r , which means that no other edge polynomial can increase the exponent of x r .
Given these definitions, we formally define two different independent motif counting problems next.

Definition 1 (INDEPENDENT MOTIF COUNTING IN PROBABILISTIC NETWORK I).
Given a probabilistic network G = (V , E, P) and a motif pattern M, find a set of independent embeddings which yields the maximum expected number of occurrences in G, which is We explain the problem on a hypothetical probabilistic network G(see Fig. 2). To better explain the problem, we also list some possible deterministic networks in Fig. 2. Notice that this probabilistic network has the same network topology as the deterministic network G o in Fig. 1a. As a result, G has six possible embeddings same with G o , which are H 1 , H 2 , H 3 , H 4 , H 5 and H 6 (see Fig. 1c-h). According to the problem definition, we seek to find a set of non-overlapping embeddings which contributes to the maximum expected number of motif count over all possible deterministic network topologies. For those six embeddings of G, we are able to construct five sets of independent embeddings, which are {H 1 , H 6 }, {H 2 }, {H 3 }, {H 4 } and H 5 (see Fig. 3 for the relationship between embeddings). For each set, we summarize the expected motif count over the set of all alternative deterministic network topologies based on Eq. 4. Table 1 lists the result. Then, we choose the set with maximum motif count. Notice that, the resulting embedding set with the maximum expected motif count is not guaranteed to always have the largest motif frequency among all possible deterministic networks. For example, in deterministic network G o 1 , the set {H 1 , H 6 } has the highest motif frequency; while in network G o 3 , it is the set {H 2 } achieves the largest motif count. By requiring to select the set of embeddings with highest frequency in each possible deterministic network, we have our second independent motif counting problem. We formally define it next.  Fig. 1) in network G (Fig. 2). The table shows the number of embeddings occurring at each deterministic network for each independent embedding set and its expected value in G

Definition 2 (INDEPENDENT MOTIF COUNTING IN PROBABILISTIC NETWORK II). Given a probabilistic network G = (V , E, P) and a motif pattern M, compute the
Expected motif count

expected number of maximum independent occurrences of M in G, which is
Notice that in this problem, we are required to always select the largest independent embedding set in each possible deterministic network topology. We compute the expected number of independent motif by iterating over all possible deterministic networks and summing up the motif count. For example, in the example network (Fig. 2), the expected independent motif count is calculated by . The former definition of the independent motif counting problem above (Definition 1) seeks the genes, which are more likely to carry out the function characterized by the given motif across all possible deterministic topologies. The latter definition (Definition 2) does not care about the identity of the set of genes engaged in the process as the set of genes vary depending on the deterministic network topology observed. It instead counts the number of different ways we can observe the process separately for each topology even though that set may differ from one topology to another. In this paper, we focus on the first problem. The rationale is that we often do not know the specific deterministic topology realized at a given point in time. Furthermore, this topology can vary over time. Notice that this problem can be solved by enumerating all possible deterministic network topologies and independent embedding sets. However, it is infeasible to scale to large networks as the numbers of deterministic network topologies and independent embedding sets grow exponentially. In this paper, we develop a scalable method to tackle this problem by utilizing a polynomial model and three strategies. We discuss this polynomial model and three strategies next.

Overview of the existing solution
Here, we briefly describe the method by Sarkar et al. [24] for counting independent motif instances, as our method utilizes the same polynomial model in that study. Given a probabilistic graph G = (V , E, P) and the specified motif pattern M, the algorithm works in three steps. First, it discovers all motif embeddings in the deterministic network G = (V , E). It then builds an overlap graph for these embeddings. Next, it uses a heuristic strategy to count non-overlapping motif embeddings; it calculates a priority value for each node (we explain how to compute priority value below) and iteratively picks the node with the highest priority in the overlap graph. It includes the corresponding embedding to the result set, adds the probability that this embedding exists to the motif count and removes this node along with all of its neighbouring nodes from the overlap graph. It repeats this process until the graph is empty.
The key step of this method is calculating the priority value for each node in the overlap graph. The priority value of a node primarily depends on the number of neighbours of a node. In a probabilistic networks, both the existences of an embedding and its overlapping embeddings are uncertain as the edges which make up those embeddings are probabilistic. To accurately model this uncertainty, for each embedding H k , it first calculates a gain value a k , which equals to the probability that H k exists a k = e∈H k P(e) . Then it computes a loss value using the number of neighbours of H k which is represented with a random variable B k . It then computes the loss value of H k as a function of B k , denoted with f (B k ). Finally, it determines the priority value, denoted with ρ k , as a function of gain value and loss value. In this paper, we compute ρ k as a k /f (B k ).
Sarkar et al. compute the distribution of B k using a x-polynomial. To construct this x-polynomial, it first builds an undirected bipartite graph denoted with G k = (V 1 , V 2 , E). Then for each node v j ∈ V 2 , it constructs an edge polynomial Z j . After multiplying all edge polynomials and collapsing it, the x-polynomial takes the form The coefficients of the polynomial Z H k is the true distribution of the random variable B k (i.e., ∀j, the coefficient of t j is the probability that B k = j). For any further information, we refer the interested readers to [24].

Avoiding loss computation
Recall that, we calculate the distribution of B k for all nodes of the overlap graph only to select the one that yields the highest priority value ρ k () (see "Overview of the existing solution section"). Here, we develop a method to quickly compute an upper bound to ρ k . This allows us to avoid computation of the distribution of B k for the node v k when the upper bound to ρ k is less than ρ j for any node v j considered prior to v k . To explain this strategy, we first present our theory which establishes the foundation of the upper bound computation. We start by defining our notation.
Consider G k = (V 1 , V 2 , E) of an embedding H k . For a given subset V 2 ⊆ V 2 , let us denote the x-polynomial of H k after multiplying the edge polynomials of node set V 2 with Z H k ,V 2 . Below, we discuss our theory using a lemma, a theorem, and a corollary.

Lemma 1 Consider the bipartite graph of motif
We first discuss how to compute the probability that exactly τ neighboring embeddings of H k exist. After multiplying edge polynomials and collapsing, Z H k ,V 2 takes the following form: Here, l α jl , which sums up all the coefficients of the polynomial terms containing t j , equals to the probability that exactly j neighboring embeddings of H k exist after multiplying the edge polynomials of V 2 . Next, we focus on one polynomial term from the above x-polynomial. Let us denote this polynomial term as Then after multiplying one more edge polynomial, say Z r = p r Two cases may happen after the collapsing of the polynomial terms B and C.
Case 1: There is no collapse. The exponent of the variable t of polynomial terms B and C remains the same. Adding up the coefficients of term t j , we get Thus, after multiplying another edge polynomial, the coefficient of term t j remains the same. In other words, multiplying another edge polynomial has no effect on Case 2: There is collapse. In this case, the exponent of the variable t of polynomial term B will increase while it stays the same for polynomial term C, since multiplying the second term of Z r does not introduce any x variable. Let us denote the increment in the exponent of t (i.e., the number of x i variables which collapse after multiplying Z r ) with j 0 . Now the polynomial terms B and C become x c i i respectively. How this multiplication affects P(B k ≥ τ ) depends on the relationship between j and τ . We have two cases: Case 2.a When j < τ, polynomial term A does not contribute to P(B k ≥ τ ) before multiplying Z r . After multiplying Z r , polynomial term C also does not contribute to P(B k ≥ τ ). Whether polynomial term B contributes to P(B k ≥ τ ) depends on the relationship between j + j 0 and τ . If j + j 0 ≥ τ , the probability that j + j 0 neighboring embeddings of H k exist grows. Thus, based on the Eq. 7, P(B k ≥ τ ) increases by p r α (i.e., the coefficient of t (j+j 0 ) ). On the other hand, if j + j 0 < τ, polynomial term B has no effect on P(B k ≥ τ ). In conclusion, after multiplying one more edge polynomial, the value of P(B k ≥ τ ) either increases or remains the same. Mathematically, When j ≥ τ , the polynomial term A contributes to P(B k ≥ τ ). From Eq. 7, before multiplying Z r , the amount of contribution of polynomial term A to P(B k ≥ τ ) is α. After multiplying Z r , the amount of contribution is equal to the sum of the coefficients of the polynomial terms B and C, where is αp The above lemma leads to the following theorem: Theorem 1 Consider a motif embedding H k and its corresponding bipartite graph G k = (V 1 , V 2 , E). Also consider a subset V 2 ⊂ V 2 . Given a monotonic function γ () : R → R such that γ (0) = 0 and for ∀x ≥ y ≥ 0, γ (x) ≥ γ (y) ≥ 0. ∀v r ∈ V 2 − V 2 , we have Proof From the monotonicity of γ () function, for ∀j ≥ 1, we have From Lemma 1, given V 2 and v r ∈ V 2 − V 2 , for ∀j ≥ 0, we have For ∀j ≥ 1, by multiplying both sides of the inequality with (γ (j) − γ (j − 1)), we get Thus, summing up this inequality ∀j ≤ |V 1 |, we get We rewrite the left side of this inequality as and γ (0) = 0, we rewrite Eq. 9 as Similarly, we rewrite the right side of Inequality (8) as Using the above equations, We rewrite the Inequality (8) as As γ (0) = 0, using the above inequality, we get This theorem gives us a general form of f (B k ) function which is monotonically increasing. For example, the expected value of B k , Exp(B k ) falls into that category. Corollary below proves it: Corollary 1 Given V 2 and v r ∈ V 2 − V 2 , the expected number of neighboring embeddings of H k monotonically increases with growing edge polynomial set: Proof The expected value of B k can be computed as We have γ (j) = j which is a monotonical function. Thus, from Theorem 1, we have Using Theorem 1, we develop our method for avoiding the costly computation of the distribution of B k for each embedding H k of the given motif in the target network. Our method works for all monotonic loss functions (e.g., f (B k ) = Exp(B k )). Assume that, ∃k > 1, ∀i 1 ≤ i < k, we already computed the values a i , distribution of B i , f (B i ), and thus ρ i . Let us denote the largest observed priority value so far with ρ = max 1≤i<k {ρ i }. We explain next how we use this information to avoid computation of the distribution of B k whenever possible. Let us denote the bipartite graph of H k with G k = (V 1 , V 2 , E). Our algorithm iteratively multiplies the edge polynomials for all the nodes in V 2 one by one and collapses the resulting polynomial. Let us denote the set of nodes in V 2 with V 2 = v 1 , v 2 , . . . , v |V 2 | . Without losing generality, let us assume that we multiply the edge polynomials in the order v 1  (i.e., the priority value monotonically decreases if loss value monotonically increases). Following from this theorem, we terminate computation of B k as soon as ρ j k becomes less than the best priority value observed, ρ . This eliminates costly polynomial multiplication for H k .
When to stop the calculation of B k largely depends on the best priority value ρ observed so far. The larger ρ is, the sooner we terminate the computation of B k . Thus, the ideal ordering places embeddings with larger priority values should earlier. The dilemma here is that we do not know the priority values of the embeddings at this stage. Therefore, we use a proxy value of each embedding H k , denoted with Q k , which is trivial to compute, a k (i.e., the gain value of H k ) divided by the number of overlapping embeddings with H k . We rank embeddings in descending order of their Q k values. The rationale behind using the value Q k is as follows. Recall that the priority value of H k is determined by the gain value a k and the loss value, which largely depends on the distribution of its neighboring nodes. Q k conjectures that the larger the degree of the corresponding node of H k is, the larger its loss value is. Thus, Q k is inversely proportional to the number of embeddings, which conflict with H k .

Efficient polynomial collapsation
Collapsation plays an important role in calculating the distribution of B k of the embedding H k efficiently. The sooner we collapse the polynomial terms, the earlier we compute an upper bound to ρ k . Here, we introduce two orthogonal strategies to ensure early collapsation during the construction of the x-polynomial. We describe our strategies on the bipartite graph G k = (V 1 , V 2 , E) of H k . Our first strategy focuses on V 1 . The second one focuses on V 2 .
Optimization on V 1 In order to collapse an xpolynomial term, which contains variable x i , we need to multiply all the edge polynomials containing x i (see Eq. 3). The degree of the node v i ∈ V 1 , deg(v i |G k ) is equal to the number of edge polynomials that the variable x i needs to be collapsed. Consider a node v i ∈ V 1 with deg(v i |G k ) = 1. Suppose that ∃v j ∈ V 2 , (v i , v j ) ∈ E. We collapse the variable x i as soon as the edge polynomial Z j has been multiplied into the x-polynomial. In this case, the collapse operator φ i will replace the variable x i in xpolynomial with t. Following from this observation, our first strategy works as follows. Consider a node v j ∈ V 2 . Let us denote the set of all nodes v i ∈ V 1 for which deg(v i |G k ) = 1 and (v i , v j ) ∈ E with V 1,j . We rewrite Eq. 1 (see Section "Preliminaries and problem definition") as The above equation means that before we do any polynomial multiplication, we first apply the collapse operator φ i (∀i, such that v i ∈ V 1,j ) to Z j . This preemptive collapsation prevents the exponential growth in the number of collapsation operations for those v i satisfying the conditions above. For example, in the example bipartite graph (see Fig. 4), for edge e 8 , its original edge polynomial Z 8 = p 8 x 6 +q 8 can be rewritten as Z 8 = p 8 t +q 8 to avoid applying the collapse operation φ 6 () in any further polynomial multiplication.
Optimization on V 2 The order in which edge polynomials are multiplied has a great effect on the cost of polynomial multiplication. Recall from "Preliminaries and problem definition section" that each variable x r collapses only after the multiplication of the final edge polynomial for x r has been completed. Following from this observation, our second strategy conjectures that increasing the number of collapsing variables after the product of a given number of edge polynomials reduces both the running time and the amount of memory needed to store the x-polynomial. We explain this on the bipartite graph in Fig. 4. In our example, to simplify our notation, we will only consider the optimization on V 2 and ignore the impact of the optimization on V 1 described above. We have four edge polynomials in total. If we multiply four edge polynomials in the order of Z 1 , Z 2 , Z 4 , Z 8 , no collapsation takes place until we multiply Z 4 . This is because Z 4 is the final edge polynomial for x 1 , x 2 , x 3 and x 5 . As a result, this ordering requires in total 32 collapses (i.e., number of polynomial terms is 2 3 = 8, and we collapse each of x 1 , x 2 , x 3 and x 5 resulting in 8 + 8 + 8 + 8 operations). Adding the collapsation cost of the last edge polynomial, 2 4 = 16, we achieve 32 + 16 = 48 operations in total. Now, let us analyse the cost of the same product when we multiply the edge polynomials in the order of (Z 4 , Z 2 , Z 8 , Z 1 ). In this ordering, Z 4 is the final edge polynomial for x 5 . Thus, we need another 2 1 = 2 operations for variable x 5 . The following edge polynomial Z 2 is the final edge polynomial for variables x 2 and x 3 . Therefore, once we multiply Z 2 , we can collapse variables x 2 and x 3 . Thus, for variables x 2 and x 3 , only 8 collapses are needed (i.e., there are 4 polynomial terms and 2 collapse operators). Variables x 6 and x 1 collapses after the product of Z 8 and Z 1 leading to eight and sixteen more collapse operations respectively. In total, this ordering yields only 34 (i.e., 2+8+8+16) collapses. By reordering the edge polynomials, we reduce not only the time for collapsation, but also the memory space for storing the variables. Furthermore, as we explain below, an effective ordering has potential to avoid the loss computation without losing the accuracy of the result. Next, we formally define the problem of ordering of the edge polynomials.

Definition 3 ORDERING OF THE EDGE POLYNOMIALS .
Assume a bipartite graph G k = (V 1 , V 2 , E) and a specified loss value denoted by are given. Each node v i ∈ V 1 has a unique variable x i and a collapse operator φ i . Each node v j ∈ V 2 has an edge polynomial Z j . For each collapse operator φ i , let us denote the number of the polynomial terms it has been applied to with N φ i . Let us denote a permutation of the integers in the [ 1 : |V 2 |] interval with π =[ π 1 , π 2 , . . . , π |V 2 | ]. Our problem is to find the ordering π, for which ∃r, such that after multiplying the first r polynomials in the order of π, we have f (B k ) ≥ and r|V 1 |2 r + Notice that in the definition above, we aim to minimize the number of edge polynomials (i.e., variable r). Also if there are multiple orderings with the same number of edge polynomials, we prefer the one that requires the least collapsation operations (i.e., One straightforward method to solve this problem is to calculate the collapsation cost for all possible orderings of the edge polynomials and choose the one with the smallest cost. This, however, is infeasible as there are |V 2 |! alternative orderings. Here, we develop a greedy iterative algorithm to quickly estimate an ordering. Briefly, at each iteration, our algorithm chooses the edge which contributes to the collapsation of most variables. We explain our algorithm using the bipartite graph shown in Fig. 4.
Our algorithm maintains two matrices. We denote these two matrices at the ith iteration with W i and D i . At each stage, we update the W i matrix and generate a D i matrix based on W i . We will explain these two matrices in detail later. We choose a suitable edge polynomial using the D i matrix, and repeat this process until the last edge polynomial.
We first explain the two matrices above. The first matrix denoted by W i maintains the relationship between the x variables and edge polynomials. Let Figure 5 presents matrix W 1 corresponding to the bipartite graph in Fig. 4. Here, Z 2 and Z 4 contain x 2 . As a result W 1 [ 2,2] We construct matrix D i from W i . This matrix counts different levels of contributions of each edge polynomial. It has |V 2 | rows and max u∈V 1 {deg(u|G i k )} columns. We set the entry D i [ r, s] to the product of the number of entries in the rth column of W i having the value 1/s and the edge probability in Z r . For example, in the matrix W 1 in Fig. 5, Z 2 (i.e., column 2) contributes two variables at value 1/2 and one variable 1/3. As a result, we set the second row of At the ith iteration, using the matrix D i , we choose the next edge polynomial for multiplication as follows. We start by looking at the first column of D i , and choose the row (i.e., edge polynomial) with the largest value. If there are multiple such rows, we use the second column of D i among those polynomials. We repeat this process until we find such a row with the largest value. If there are still more than one rows after reaching the last column of D i , we randomly choose the edge polynomial corresponding to one of them. For example, in Fig. 5, both Z 4 and Z 8 has values in the first column. We choose one of them depending on their edge probability. If they have the same edge probability (i.e., p 4 = p 8 ), we look at their values in the second column, where Z 4 should be selected.
At the ith iteration, assume that we pick the edge polynomial Z r . We update G i k for the next iteration by removing the node v r from G i k along with all of its incident edges.

Overcoming memory bottleneck
The number of terms of the x-polynomial can grow exponentially, particularly when collapsation does not take place. This quickly leads to memory bottleneck especially for dense overlap graphs. In this section, we present a recursive strategy to overcome this bottleneck.
The main idea behind this strategy is as follows. Given a new edge polynomial, we multiply it with only a subset of the current x-polynomial terms, while deferring others. After completing the multiplication of all edge polynomials, we multiply the deferred polynomial terms using last-in, first-out policy. Figure 6 depicts this idea. Here, the shaded bar represents the subset of terms of the current x-polynomial to be multiplied with the next edge polynomial (e.g., Z i+1 ). Let us denote the number of terms in this subset with N 1 and the number of deferred terms  with N 2 prior to multiplying with Z i+1 . After multiplication with Z i+1 , the number of terms become 2N 1 + N 2 . Repeating this process, after multiplying j edge polynomials, we have only (j + 1)N 1 + N 2 terms. Notice that this is a dramatic improvement over the original algorithm which creates 2 j (N 1 + N 2 ) terms. The number of deferred terms is governed by the amount of available memory. That is we choose N 1 as large as possible while the maximum term count ((j+1)N 1 +N 2 ) remains less than available memory.

Results and discussion
In this section, we experimentally evaluate the performance of ProMotE on synthetically generated ("Evaluation on synthetic networks section") and real networks ("Evaluation on cancer networks" to "Evaluation on neurodegenerative disease networks section"). We run our experiments on a Linux server which has AMD Opteron 24 core processors (up to 1.4GHZ) and 32GB memory .

Running time evaluation
The key computational challenge we aim to tackle in this paper is to scale to large network sizes while existing methods fail. Here, we evaluate how well we achieved this goal. We compare the running time of ProMotE to that of Sarkar et al. [24] as this is the most recent and most efficient algorithm in the literature to the best of our knowledge. As these two methods have the similar expected motif count, we do not report their motif count. We test both methods on synthetic networks. We generate random networks with growing number of nodes using three random network models, namely Erdős Rényi (ER) [25], Watts Strogatz (WS) [26] and Barabási-Albert (BA) [27] models. We synthetically assign a probability value generated from the uniform distribution in the interval (0,1) to each edge of networks. We set the average node degree to two. For each number of nodes, we repeat the experiments for 20 random networks. We measure the total running time for four motif patterns (see Fig. 7). These motifs are also used in the study by Sarkar et al. [24]. To avoid the potential outlier, for the 20 networks of each network size, we exclude the two networks with largest running time, and another two with the smallest running time. For the remaining networks, we report the average, minimum and maximum values. For each method, we report the running time on networks if the algorithm completes in less than 10,000 seconds. Figure 8 presents the results.
We observe that our method is several orders of magnitude faster than Sarkar et al. for all parameter settings. As the network size increases, the gap between the running times of two methods grows. This indicates that our optimization strategies are greatly helpful on large networks. The advantage of ProMotE stands out the most for the networks generated using ER and BA models. For instance, on networks generated using ER model (see Fig. 8a), ProMotE successfully scales to massive network sizes (i.e., thousands of nodes). On the other hand, Sarkar et al. fails to run beyond 100 nodes. This implies that ProMotE is not only practical but it is also essential to study real networks, as the size of real networks is often large. Furthermore, we observe that the topological characteristics of the network greatly affect the cost of counting independent motifs. For the same network size, we observe a dramatic increase in running time as we move from WS model to ER model and then to BA model. ER model generates networks with a small average shortest path length along with a small clustering coefficient. WS model however builds networks with both short average paths and strong clustering coefficients. Thus, networks generated by ER model are more likely to have isolated nodes, which in turn make the remaining connected components much denser than those in networks generated by WS model. Dense networks however result in more computation. As a result, ProMotE runs much faster on networks of WS model. For the BA model, it constructs the so called scale-free networks characterized by a highly heterogeneous degree distribution, which follows power law. Thus, there exist more hub nodes in networks of BA model, which result in more overlapping motif embeddings. As a result, the overlap graphs of networks of BA model are much more complicated, which in turn introduce a large quantity of computation. Recall that the BA model generates scale free networks. Thus it is expected to generate networks that resemble real data better than the other two models. This indicates that counting independent motifs in real networks is a challenging problem, and ProMotE is needed to study them in practical time.

Comparison against the literature
In the literature, current approaches to the probabilistic networks often transform probabilistic networks to deterministic networks first, and then apply methods for deterministic networks. These approaches include ignoring probability values [28,29], considering edges with probability values above a given threshold [30], and sampling the probabilistic network by doing a Bernoulli trial with probability p i for each edge e i [31]. For simplicity, we denote these three approaches with binary, threshold and sampling method respectively. For each of these method, after transforming probabilistic networks to deterministic networks, we apply the method in the literature [19] to find the motif count, which heuristically selects motifs with the least number of overlapping embeddings. As these methods are not specifically devised to solve the problem in this paper and they finally work on deterministic networks, the performance of ProMotE in terms of running time is significantly worse than these methods. As a result, we compare our method against these methods only in terms of expected motif count.
Recall that the threshold method maintains the set of edges with probability value above a given threshold value and removes the remaining edges. Thus, the outcome of this method depends on the threshold value. Here, we first evaluate the performance of the threshold method for varying values of threshold. We run our experiment on synthetic networks with WS model of various network sizes, 1000, 5000 and 10000. In particular, for each network size, we repeat the experiments for 20 random networks. We vary the threshold value from 0 to 0.7 at increments of 0.1. For each threshold value setting, we run experiment on all generated networks and report the average motif count. Figure 9 plots the results.
We observe that the motif count of the threshold method first grows with the increasing threshold value. It then falls sharply. It obtains the peak value when the threshold is 0.2. This is possibly because too small/large threshold leads to the retaining/removing most of the edges of the network. Either case may make the threshold method underutilize the information available in the interaction probabilities. As a result, a suitable threshold value is necessary for the threshold method. However, the varying distribution of edge probabilities of different probabilistic networks makes it is difficult to set a fixed threshold value for the threshold method. In the rest of our experiments, we fix the threshold value of the threshold method to 0.2 as it obtains the best value on the average across a broad spectrum of parameter settings.
Next, we compare our method with binary, threshold and sampling methods. We test these methods on synthetic networks with different network models of various Fig. 9 The expected motif count of our method and threshold method on synthetic networks network sizes. To ensure the reliability of our results, for each parameter, we conduct experiments on 20 different networks and report the average. Specifically, for sampling method, as each running on the same network may have the different value, to get a reliable result, we run 10 times on each network and report the average. Figure 10 reports the result. Our results demonstrate that our method has the highest motif count for all network sizes with various network models. Moreover, the gap between ProMotE and other methods increases with the growing number of nodes. This is very promising since we expect to have more number of nodes in real networks. For instance, on the network with WS network model of size 10,000, Pro-MotE has 13.2 to 95.4% more motif counts than other methods. The threshold method achieves the second best motif count. That said, it is worth noting that we give a positive bias towards the threshold method since we fix the threshold to the value which maximized its motif count in our previous experiments. Sampling method has overall worst performance across different network sizes. The reason behind is that finding a set of independent embeddings that yields the maximum expected number of occurrences in a probabilistic network is a nonlinear function. Although, the sampling approach can provide provable confidence intervals for estimating linear functions such as sum and average, it fails to do that for nonlinear functions such as counting independent motifs in probabilistic networks. Due to the nonlinear nature of our problem, a sampling approach is expected to produce inaccurate results. Furthermore, we observe that the expected motif count of all methods grows with the increasing network size. This is because the network with larger size is expected to have more motifs.
In summary, our method is very efficient and accurate on counting independent motif in probabilistic networks.

Evaluation on cancer networks
In order to observe the performance of our method on real networks, we apply our algorithm on six cancer datasets from the MSigDB database [32] for Homo Sapiens. We extract genes from the C2:curated gene sets of MSigDB. We then feed each cancer gene set into the STRING database [16] to generate its interaction network. We use the gene co-expression values present in the STRING dataset for these networks to compute their interaction  probabilities. Specifically, for a pair of nodes, the interaction probability between them is computed as the Pearson's correlation coefficient. In the literature, there are also other ways to compute interaction probabilities. For example, Sharan et al. [33] addressed this problem by utilizing features like the volume of evidence present for the interaction, gene expression correlation, and network topology to learn the edge probabilities. Gabr et al. [34] used end-to-end signal reachability probabilities between pairs of genes to guide the computation of the edge probabilities. All these studies use transcriptional values in their computations in slightly different ways. Table 2 presents the dataset details for each cancer type. We measure independent motif counts for the four basic patterns listed in Fig. 7. Figure 11a presents the running time of ProMotE. In addition, we also plot the average degree of each network to display the effect of running time on network size. Our results demonstrate that ProMotE successfully identifies independent motifs in practical time (1.5 secs to less than 2.2 h) for all networks. It is worth noting that Sarkar et al.'s method does not scale to these network sizes. Figure 11b shows the number of non-overlapping instances of the four basic patterns present in each of the cancer networks. We observe that the pattern M1 is the most abundant topology in all networks. This is expected as M1 is a subgraph of M2, M3, and M4. The other three motifs have similar counts. We also observe that the motif count does not necessarily grow with the network size. For instance Thyroid cancer has the least number of nodes and edges, yet it contains more non-overlapping motif instances than almost all the other networks in our dataset. This implies that the topology of the network governs the motif distribution. Recall that the running time of ProMotE tends to grow with network size on the cancer networks (see Fig. 11a). This implies that the running time of ProMotE does not strongly depend on the independent motif count as well.

Evaluation on neurodegenerative disease networks
Next, we evaluate ProMotE on three neurodegenerative disease networks; Alzheimer's, Huntington's and Parkinson's disease, hereafter referred to as AD, HD and PD respectively. We obtain this dataset from the MSigDB database similar to the cancer networks (see "Evaluation on cancer networks section"). Table 3 lists the dataset details. One major difference between this and the cancer dataset is that the neurodegenerative networks are both larger in size and average degree. It is worth noting that the AD network is a subgraph of that of HD network. Also, all three networks share substantial amount of edges (details later in this section). We focus on motif M2 (the loop pattern) in this experiment. Table 4 shows the results. We observed that AD and HD networks yield the same number of embeddings and F 2 count. Totally, 37 to 56% of the genes participate in at least one motif embedding. PD network has the largest motif count and fraction of genes in motif embeddings. Such large motif count indicates that these networks are organized largely as a combination of small loops. Finally, our method completed counting motifs in slightly over half an hour per network. This demonstrates that our method scales up to very large network sizes and densities.  Next, we take a closer look at the distribution of the shared edges, which appear in a motif embedding, across different diseases. Figure 12 presents the distribution. We observe that the edges present in AD and HD are exactly the same. PD network however deviates from the other two by about 6%. This implies that all three diseases possibly are governed by very similar processes, with PD having slight variations.
Next, we focus on the statistical significance of the biological processes and molecular functions of the genes that comprise the result set for the three neurodegenerative diseases for M2 pattern. To do this, we perform gene ontology analysis using PANTHER [35]. For each network, we use all the genes appearing in an embedding of M2. We filter the Gene Ontology (GO) terms with p-value above 0.025. We then search each of the remaining GO term in PubMed with the disease corresponding to that network (i.e., AD, HD, or PD). Table 5 shows the results for which at least one reference publication exists.
Our results demonstrate that ProMotE identifies disease specific functional properties. For instance, mitochondria often referred to as the power house of the cell is responsible for many important cellular functions especially neuronal viability. Therefore, aberrations in mitochondrial processes have potential to lead to neuronal disorder. Four distinct pathways carry out mental processing for brain metabolism, two of them being tricarboxylic acid cycle (TCA) and electron transport chain (the other two are glycolysis, pentose shunt). These pathways are affected by direct modification of the enzymes or alterations at the gene expression level. TCA is responsible for producing reducing equivalents in the form of reduced nicotinamide adenine dinucleotide (NADH) and reduced flavin adenine  . 12 The number of common edges present in the three disease networks for independent embeddings of the triangle motif pattern dinucleotide (FADH2). Altered activities of the TCA cycle enzymes result in imbalance which is associated with AD-related changes in metabolism. Another important function is the production of adenosine triphosphate (ATP) through the combined effects of TCA and the respiratory chain, also known as electron transport chain. The respiratory chain requires two electron carriers: ubiquinone/coenzyme Q and cytochrome c. Also, it consists of five protein complexes: NADH dehydrogenase-ubiquinone oxidoreductase (complex I), succinate dehydrogenase-ubiquinone oxidoreductase (complex II), ubiquinone-cytochrome c oxidoreductase (complex III), cytochrome c oxidase (complex IV), and ATP synthase (complex V). Production of ATP involves two coordinated mechanisms: electrons received from energy substrates such as NADH are transported through the mitochondrial complexes towards molecular oxygen, producing water; at the same time electrochemical gradient is generated by driving protons across the mitochondrial inner membrane by I, III, and IV protein complexes. Finally, ATP is produced by the accumulation of these protons into the matrix with the help of complex V (ATP synthase). Altered mitochondrial respiration, especially at the level of complex I thus associates with PD.

Conclusions
In this paper, we developed ProMotE, an efficient method to count non-overlapping motif instances in probabilistic networks. This method uses a polynomial model to capture the dependencies between overlapping embeddings. We proposed three strategies to avoid computation of loss value, to expedite collapsation of polynomial terms, and to overcome the memory bottleneck faced when applied to large networks. Our experiments on both synthetic and Table 5 Gene ontology analysis of the genes appearing in the independent embeddings of the triangle pattern in the three disease networks with publications, and the diseases associated with the ontology terms