### Functional summarization problem

In this section, we formally introduce the functional summarization problem. We begin by defining some terminology that we shall be using in the sequel.

A protein interaction network (PPI) *G* = (*V*, *E*) contains a set of vertices *V* , representing proteins, and a set of edges *E*, representing interactions. An edge has a positive real weight *ω* that represents its interaction strength. Given a GO directed acyclic graph (DAG), denoted as *D*, the ordered set Δ = 〈*a*
_{1}, *a*
_{2}, ..., *a*
_{
n
}〉 is a topological sort of *D*, where *a*
_{
i
} represents a single GO term. The *term association vector* of *v* ∈ *V* , denoted by Δ_{
v
}, is defined as Δ_{
v
} = 〈*a*
_{1}(*v*), *a*
_{2}(*v*), ..., *a*
_{
n
}(*v*)〉, *a*
_{
i
}(*v*) ∈ {0, 1}, such that *a*
_{
i
}(*v*) = 1 if and only if the term *a*
_{
i
} or its descendants are associated with protein *v*. Otherwise, *a*
_{
i
}(*v*) = 0. Note that Δ_{
v
} indicates GO terms that are associated with *v*.

### Functional summary of PPI

Given a PPI
*G*(*V*, *E*), a *functional summary graph* (FSG) is an undirected graph Θ_{
G
}(*S*, *F* ) that models the set of higher-order *functional clusters*
*S* and their interactions *F* that underlie the PPI. A *functional cluster* is a subgraph of *G* that shares a particular function/role based on the structure and attribute properties of the subgraph and its constituent proteins. Functional clusters may include complexes, processes, and signaling pathways. A pair of functional clusters may be connected by a web of protein interactions. If the number of interactions are significantly large, then we say that the pair of clusters are *associated*. An FSG Θ_{
G
} thus captures higher order modules that comprise the ppi and their interconnections. We now define these concepts formally.

**Definition 1 (Functional Cluster)**
*Let*
*V* (*a*
_{
i
}) **⊆**
*V*
*denote the set of vertices in*
*G*
*such that*
*v* ∈ *V* (*a*
_{
i
}) *if*
*and only if* Δ_{
v
}[*a*
_{
i
}(*v*)] = 1*. The* functional cluster *of a*
_{
i
}
*∈* Δ, *denoted by C*(*a*
_{
i
}) ⊆ *G*, *is the subgraph of G that is induced by V* (*a*
_{
i
}).

Note that *V* (*a*
_{
i
}) represents the set of vertices of *G* that are associated with term *a*
_{
i
}∈ Δ. In this paper, we treat *C*(*a*
_{
i
}) as a vertex as well. We may also call a functional cluster a *functional subgraph* when we wish to emphasize the fact that it is a graph. Figure 3(b) shows a subset of the possible functional clusters of the PPI in Figure 3(a). Every node in a cluster must share a particular function or attribute. For instance, nodes in functional cluster cytosol share the cytosol term.

**Definition 2 (Functional Summary Graph (FSG))**
*A* functional summary graph *of the underlying*
*protein interaction network G*(*V*, *E*), Θ_{
G
}
*, is defined as* Θ_{
G
} = (*S, F, P*
_{
i
}
*, α*)*, where S is a set of functional clusters and F is a set of edges that links the functional clusters. Let oc*
_{
uv
}
*be the number of interactions connecting proteins in C*(*u*) *and C*(*v*)*. Let P*
_{
i
}
*be the probability density function of observing o*
_{
uv
}
*or more number of interactions between C*(*u*) *and C*(*v*)*. Let*
*β*
*be a significance cut-o parameter (user-defined). Then*, (*C*(*u*), *C*(*v*)) ∈ *F if and only if P*
_{
i
}(*X > oc*
_{
uv
}) *≤* 2*β/|S|*
^{2}
*. The bijection α* : 1, 2, ..., *m ↔ S is an ordering of S*.

Observe that the aforementioned definition of functional summary includes additional constructs and rules for determining whether two functional clusters are associated. We elaborate on this further. Given a

PPI
*G*(

*V*,

*E*), the expected probability of observing an interaction between two randomly drawn protein pair is given by

. Let (

*C*(

*u*),

*C*(

*v*)) be a functional cluster pair such that members of both clusters were randomly drawn from

*V*. If proteins

*v*
_{1} and

*v*
_{2} are randomly drawn from

*C*(

*u*) and

*C*(

*v*), respectively, then the expected probability of observing a positive interaction between them would also be

*p*
_{
i
}. Let

*n* =

*|C*(

*u*)

*||C*(

*v*)

*|*. Based on the independent and identically distributed variable (

*iid*) assumption, we model the probability of observing

*oc* (the number of interactions between

*C*(

*u*) and

*C*(

*v*)) as the probability of observing

*oc* positive interactions after

*n*
*iid* trials, representing

*n* pairwise interaction trials between proteins in

*C*(

*u*) and

*C*(

*v*). Hence, the probability of

*oc* or more positive interactions between

*C*(

*u*) and

*C*(

*v*) can be modeled using a binomial distribution:

This

*p*
*-*
*value* is used to assess the

*association significance* between a pair of functional clusters. Given a set containing

*k* clusters, association significance between

pairs of clusters would have to be tested. To this end, we applied Bonferroni correction to account for multiple testing. Given the

*significance*
*cut-off β*, a pair of functional clusters is

*significantly associated* if

Observe that although we have adopted a simple model to assess cluster-cluster association, the aforementioned definition is general enough to encompass more sophisticated association models.

**Example 1** Figure 3(d) shows an FSG consisting 5 functional clusters. Any edge between two functional clusters exists when *P*
_{
i
}(*X > oc*
_{
uv
}) ≤ 2*β* =|*S*|^{2}, implying that more edges connect proteins between the functional clusters than expected in random.

### Problem statement

The functional summarization problem is the problem of finding Θ_{
G
} that best represents the underlying PPI subject to a *summary complexity constraint*. To model this problem, we propose a profit maximization model that aims to find Θ_{
G
} = (*S*, *F*, *P*
_{
i
}, *α*) by maximizing information profit under a budget constraint. Every protein *i* ∈ *V* is assigned a non-negative *information budget*
*b*, which represents the information it contains. Let *S*
_{Δ} be the set of functional clusters induced from Δ. Every functional cluster *C*(*u*) ∈ *S*
_{Δ} is assigned a non-negative *structural information value*
*ψ*
^{
C(u)}(to be defined later), which represents the amount of structural information contained within the functional subgraph. When a functional cluster *C*(*u*) is added to the summary, for every protein *i* ∈ *V* (*u*), a portion of *b* is taken out and added to summary information gain. This represents new information added to the summary. The amount to take depends on ψ^{
C(u)}. Imposing information budget *b* limits the amount of information a protein can provide. A parameter 0 ≤ *d* ≤ 10 is also introduced to penalize redundancy. By doing so, repeated representation of a protein *i* yields reduced information gain, modeling diminishing returns. Based on this profit model, we construct the set of functional clusters that maximizes profit while satisfying the constraints.

**Definition 3 (Functional Summarization Problem)**
*Let*
*K*
_{
i
}
*be a set of functional clusters such that C*(

*u*) ∈

*K*
_{
i
}
*if and only if i*∈

*C*(

*u*).

*For every C*(

*u*) ∈

*S*
_{Δ},

*let ψ*
^{
C(u) }
*be the structural information value of C*(

*u*)

*. Given a protein interaction network G*(

*V*,

*E*)

*and user-defined parameters b, d and k, the functional summarization problem constructs a k-cluster*
FSG Θ

_{
G
} = (

*S*,

*F*,

*P*
_{
i
}, α)

*that satisfies the following optimization problem:*
We elaborate on how the *structural information value*
*ψ*
^{
C(u) }is assigned. A functional cluster *C*(*u*) and its protein constituents share a common function *u*, and thus vertices in the subgraph are considered homogeneous attribute wise. However, it does not imply that the functional subgraph is structurally cohesive (dense). Proteins having common function *u* may still be weakly interacting. This may be due to the fact that *u* itself may indicate a general function (e.g., 'protein binding') which is a common attribute to a large number of proteins that do not interact with each other. We argue that structurally cohesive functional clusters contain more information than those which are loosely interconnected. The argument for this is based on the MDL principle, whereby clusters that have higher than expected cohesiveness will have higher information content because of the lower probability of observing a random cluster having the same cohesiveness. However, we make the following exception - a functional cluster with lower than expected cohesiveness is not deemed structurally informative.

Since the optimization problem must choose among a set of functional clusters, we are not concerned about the actual p-value of observing a subgraph having such interaction density. Instead, we only need a measure that allows us to compute the relative ranking of the functional clusters by their information content. Such simplification leads to much greater computation efficiency. We define the *structural information value* of a functional cluster *C*(*u*) as follows.

**Definition 4 (Structural Information Value)**
*Let ω*
_{
ij
}
*be the edge weight of* (

*i, j*) ∈

*E*.

*The* structural information value

*of a functional cluster*
*C*(

*u*)

*, denoted by*
*ψ*
^{
C(u)},

*as ψ*
^{
C(u) }=

*p*
^{
C(u) }
*where*
**Algorithm 1** Algorithm FUSE

**Input:**
*G*, Δ, *D*, *k*, *b*, *d*, *β*

**Output:** Θ_{
min
}

- 1:

- 2:
Let *B*
_{map} = set of pairs (*i*, *b*) for each *i* ∈ *V*

- 3:
Assign *ψ*
^{
C(u) }and *c*
^{
C(u) }for each *C*(*u*) ∈ *S*
_{Δ}

- 4:

- 5:

- 6:
(*C*
_{min}, *B*
_{map}) = **MapProfit**(*S*
_{Δ}, *B*
_{map}, *d*, *|V|*, *k* )

- 7:
Remove *C*
_{min} from *S*
_{Δ}

- 8:

- 9:

- 10:

- 11:

- 12:
**if**
*C*(*i*) ≠ *C*(*j*) and *P*
_{
i
}(*X > oc*
_{
C(i)C(j)}) *≤* 2*β = |S|*
^{2}
**then**

- 13:
Add edge (*C*(*i*), *C*(*j*)) to *F*

- 14:

- 15:

^{
ρC(u) }is the *ratio association* [35] score of *C*(*u*), a standard graph clustering objective we adopt that indicates the structural density of *C*(*u*). At first glance, it may seem that the structural information value should be defined as *ψ*
^{
C(u) }= *ρ*
^{
C(u) }
*- ρ*
^{random}, where *ρ*
^{random} is the *expected structural density* of a random cluster. However, we ignore *ρ*
^{random} for the following reason. In scale-free and Erdős-Rényi graphs, the self-information *-* log *P* (*ψ*
^{
C(u)}) is a positive non-decreasing function of *ψ*
^{
C(u) }for *ψ*
^{
C(u) }
*>*0. Hence, *ψ*
^{
C(u) }can be used to compare the self-information between two functional clusters without having to determine the probability density function of the interaction distribution of a subgraph. Given *a*
_{
i
}, *a*
_{
j
} ∈ Δ, *C*(*a*
_{
i
}) is deemed more informative than
If both *ψ*
^{
C(aj) }and *ψ*
^{
C(ai) }are negative, it does not matter whether one is more informative than the other, since both have structural density less than that of random networks. As such, for symmetry, we also deem that *C*(*a*
_{
i
}) is *more informative* than
for *ψ*
^{
C(aj) }
*≤* 0. Therefore, when comparing the structural density between two clusters, *ρ*
^{random} can be omitted from *ψ*
^{
C(u) }and *ψ*
^{
C(u) }is simply *ρ*
^{
C(u)}.

**Example 2** Suppose we wish to summarize the PPI in Figure 3(a) into a 3-node summary (*k* = 3). If clusters apoptosis, receptors, and TGF-beta are chosen--instead of the clusters in Figure 3(c)--we can see that the profit obtained is suboptimal. Information budget for proteins b, c are depleted due to redundancy, while information budget for proteins d, e, g, i are untapped. In contrast, functional summary in Figure 3(c) is relatively more optimal, as not only the set of clusters maximizes profit through superior coverage and minimal redundancy, but it also maximizes profit through higher structural information (e.g., the cluster transcription is structurally dense compared to apoptosis).

**Algorithm 2** The *Map Profit* procedure.

**Input:**
*S*
_{Δ}, *B*
_{map}, *d*, *|V*
*|*, *k*

**Output:**
*C*
_{min}, *B*
_{map}

- 1:

- 2:

- 3:

- 4:

- 5:

- 6:
Let (*i*, *b*(*i*)) ∈ *B*
_{temp} and *p*(*i*) = *b*(*i*) *-*
*ψ*
^{
C(u)}

- 7:

- 8:

- 9:

- 10:

- 11:

- 12:

- 13:

- 14:

- 15:

- 16:

- 17:

- 18:

- 19:

- 20:

- 21:

- 22:

- 23:
Let (*i*, *b*(*i*)) *B*
_{map} and *p*(*i*) = (*d*/10)(*b*(*i*) - *ψ*
^{
C(u)})

- 24:

- 25:
*b*(*i*) = (*d*/10)(*b*(*i*) *-ψ*
^{
C(u)})

- 26:

- 27:

- 28:

- 29:

- 30:
**return** (*C*
_{min}, *B*
_{map})

### The algorithm FUSE

The profit maximization problem is a variation of the *budgeted maximum coverage problem* [36], which is an np-hard problem. To permit a tractable solution, let us first consider a straightforward greedy approach. The initial FSG is an empty graph. Given the input protein interaction network *G*, *ψ*
^{
C(u) }for each functional cluster *C*(*u*) ∈ *S*
_{Δ} are computed. The algorithm then iteratively selects the functional cluster that leads to greatest increase in net profit of the summary. Each time a functional cluster *C*(*u*) is selected, the FSG and budget information *b*(*i*) for every protein *i* ∈ *V* (*u*) is updated. Once *k* clusters has been selected, the algorithm terminates by generating the FSG.

A major weakness of the aforementioned approach is that it tends to be "overenthusiastic" in selection of functional clusters during early iterations. Functional clusters that are too large or too small may be selected at early iterations resulting in very poor cluster choices at later iterations due to limited information budget and summary size (*k*) constraint. Hence, our proposed algorithm adds a *complexity cost* to each chosen cluster. Given graph size *|V*
*|* and summary size *k*, the *expected cardinality* of a functional cluster in the summary is defined by
. Then the *size deviation cost*, denoted as *c*
^{
C(u)}, is defined as the square of the deviation of |*C*(*u*)| from *E*[|*C*|]. That is,
. Observe that the greater the difference between |*V* (*u*)| and *E*[|*C*|], the less likely it is to be part of a summary of *k*-granularity. Clusters whose size deviates too much from the expected cardinality are penalized and therefore less likely to be selected. This reduces the chance of having too less or too much information budget remaining during the later iterations of the greedy heuristic.

The aforementioned intuition is realized in FUSE as outlined in Algorithm 1. It consists of three phases, namely, the *initialization* phase, the *greedy iteration* phase, and the *summary graph construction* phase. In the initialization phase (Lines 1-3), *ψ*
^{
C(u) }and *c*
^{
C(u) }for each functional cluster *C*(*u*) *S*
_{Δ} are computed. The greedy iteration phase (Lines 4-10) involves iterative addition of functional clusters into *S* in a greedy manner as described above. The best candidate functional cluster for the current round (*C*
_{min}) is determined through the subroutine **MapProfit** (Line 6). This step also maintains the information profit of *S* and the remaining information budget of every *v* in *G* through a persistent *pro t map* (*B*
_{map}). *C*
_{min} is then removed from the candidate pool *S*
_{Δ} and added to the solution set *S* (Lines 7-8). Finally, the summary graph construction phase (Lines 11-15) computes *F* to generate the FSG Θ_{min}.

The **MapProfit** procedure is outlined in Algorithm 2. In order to identify the best candidate cluster of the current iteration round, it evaluates every cluster in the candidate pool by evaluating its profit gain potential (Lines 1-21). First, the amount of information to extract from each protein's information budget pool (*b*(*i*)) is computed (Lines 7-13). Next, the potential profit gain is adjusted to compensate for the complexity cost (Lines 15-16). After *C*
_{min} is found, the profit map is recomputed to commit the changes made to the information budget map due to the selection of *C*
_{min} (Lines 21-29).

**Theorem 1**
*Algorithm*
FUSE
*takes*
*O*(|*S*
_{Δ}|^{2}|*V* |^{2}) *time in the worst case*.

### Proof of theorem 1

In the initialization phase, *ψ*
^{
C(u) }for each *C*(*u*) *S*
_{Δ} has to be computed. Each *C*(*u*) may contain up to |*E*| edges and |*V* | vertices. In Algorithm 1, *ψ*
^{
C(u) }for each *C*(*u*) *S*
_{Δ} takes *O*(|*E*|) time. Thus, thus the total complexity for this procedure is *O*(|*E*||*S*
_{Δ}| + |*V* ||*S*
_{Δ}|) time.

In the greedy iteration phase, the **MapProfit** subroutine defined in Algorithm 2 is evaluted *k* times. In Algorithm 2, lines 2-21 require *O*(|*S*
_{Δ}||*V* |). Lines 22-29 require *O*(|*V* |) time. Thus, Algorithm 2 takes *O*(|*S*
_{Δ}||*V* | + |*V* |) time. The iteration phase, as such, takes *O*(*k*|*S*
_{Δ}||*V* | + *k*|*V* |) time in total.

Finally, the summary graph construction phase involves pairwise significance evaluation of the resultant functional cluster set. This involves evaluation of all edges between *k*-pairwise functional clusters of the summary. Each significance *P*
_{
i
}(*X > oc*
_{
uv
}) test requires a single-pass evaluation of edges connecting a pair of clusters. At worst case, this takes *O*(*|E|*) time. The summary graph construction phase therefore require *O*(*k*
^{2}|*E*|) time.

The FUSE algorithm, as whole, takes *O*(|*E*||*S*
_{Δ}| + |*V* ||*S*
_{Δ}
*|* + *k|S*
_{Δ}
*||V*
*|* + *k|V*
*|* + *k*
^{2}
*|E|*) time. In the worst case scenario of *|E|* = *|V*
*|*
^{2} and *k* = *|V*
*|*, the algorithm takes *O*(*|S*
_{Δ}
*||V*
*|* + *|S*
_{Δ}
*||V*
*|*
^{2} + *|V*
*|*
^{2} + *|V*
*|*
^{4}) time, implying a polynomial time complexity at worst possible case.

### Evaluation metrics

We used the

*coverage* metric to evaluate the fraction of the annotated protein interaction network covered by a summary. A functional summary with high coverage is desirable because it is more representative of the underlying interaction network than a summary with low coverage. The coverage of a functional summary Θ is defined as:

The coverage is the ratio of the total number annotated proteins in the summary over the total number of annotated proteins in the protein interaction network.

The

*redundancy* metric is the average number of functional clusters each protein belongs to. This is an indicator of the amount of cluster overlap in the summary. Redundancy of Θ is defined as:

A summary Θ with no overlapping clusters will have lowest possible redundancy value of 1, where every protein is assigned to exactly one cluster. A summary with high redundancy is undesirable, because a summary with many highly overlapping clusters is less intuitive and more complicated.

The following well-known evaluation metrics are also used - *precision and recall*. These are well known statistical measures to indicate accuracy and completeness. Precision, a measure of exactness, is defined as
. Recall, a measure of completeness, is defined as
. If a cluster *C*(*i*) is assigned with the function *i*, then any protein *p* ∈ *C*(*i*) that is not annotated with *i* or its descendants is deemed a false positive. If *p* ∈ *C*(*i*) is annotated with *i* or descendants, it is a true positive. Likewise, a protein *p* ∈ *V* that is annotated with *i* but not in *C*(*i*) is deemed a false negative. Here, proteins without annotations are not taken into consideration.