### Overview

As we know, a PPI network can be described as an undirected and unweighted graph, *G*=(*V,E*), where *V* and *E* represent nodes (proteins) and edges (interactions) in the network. In our method, we first assign weights to all edges according to their importance to the network and remove those with lower weights as noise. Then the steps for identifying overlapping modules are performed. The main idea of identifying overlapping parts in OMIM is to find nodes that have comparatively positive effects on different modules. In addition, hubs were also found according to connections with their neighbors [12].

### De-noising

In general, data in PPI networks are obtained from high-throughput protein-protein interaction experiments. So far, the most frequently used protein-protein interaction detection methods are yeast-2-hybrid, tandem affinity purification, mass spectrometry technology and protein chip technology. Although these high-throughput detection methods make for easy experimentation, they bring about noise and incompleteness [13–15].

The main idea in our de-noising step is to assign a weight to each edge of a PPI network to reflect the reliability of the corresponding interactions. In our study, we use a popular metric from graph theory, i.e., clustering coefficient. A clustering coefficient is a measure that represents the interconnectivity in the neighborhood of a node [

16]. The clustering coefficient of node

*i* with degree

*k*
_{
i
} can be described as

where *n*
_{
i
} denotes the number of triangles that go through node *i*.

The weight between nodes

*i* and

*j* can be assigned according to the following equation:

where

*CC*' represents the clustering coefficient after the edge between

*i* and

*j* is removed. According to the viewpoint of Asur et al. [

16], if two nodes are not actually connected in the original network, then the

*SCC*(

*i,j*) value should be small or equal to zero. Here, we define a threshold

*α*, and remove edges that are smaller than

*α* as noise.

### Overlapping module identification method

#### Newman algorithm

Because OMIM is a variant of the Newman algorithm, we first introduce the Newman algorithm briefly. This is a hierarchical agglomerative method based on the idea of modularity [

11]. We know that modularity is a measure of the quality of a particular division of a network and a large value of modularity always corresponds to good network division [

17]. If we let

*e*
_{
rk
} be the fraction of edges in the network, connecting nodes in group

*r* to those in group

*k* and let

, then

where *Q* is a quality function representing modularity. The physical meaning of Eq. (4) is that modularity is equal to the fraction of edges that fall within modules, minus the expected value of the same quantity if edges fall at random without regard to its modular structure [11]. The Newman algorithm is a method for optimizing *Q* in order to discover the best modular structure.

The steps of the Newman algorithm can be summarized as follows.

Step 1. Initialize each node in the input data to be a module, define a matrix

*e* and a vector

*a* according to Eqs. (5) and (6).

where *m* represent the total number of edges in the network.

Step 2. Calculate the change of modularity Δ

*Q* according to:

Merge module pairs with the maximum value of Δ*Q*. Update matrix *e* by adding the rows and columns of the corresponding merged modules.

Step 3. Repeat Step 2, until the entire network has become one big module.

From this description, the progress of the Newman algorithm can be represented as a dendrogram. If we choose to cut at different levels, different modular structures can be obtained. Actually, Newman chooses to cut at the maximum value of *Q* to obtain the best modular structure.

### Identifying overlapping parts

It should be noted that complexes in PPI networks are not static and proteins can be included in different modules. Therefore, identifying overlapping parts between different modules is necessary. We first perform the Newman algorithm to the input data. Then we try to identify overlapping nodes according to their contribution to modularity. The detailed steps are as follows.

Step 1. Perform Newman algorithm. All nodes are clustered without overlapping parts.

Step 2. Define nodes, whose neighbors belong to more than two modules, to be candidate nodes.

Step 3. Randomly select node

*i* from the set of candidate nodes. Assume that

*i* is in module

*A* and one of its neighbors,

*j*, in module

*B*. Copy

*i* to

*B* and a new module

*B*' is obtained. If Eq. (

8) is satisfied, then

*i* is an overlapping node.

where *Q*
_{
B
} and *Q*
_{
B'
} is the modularity of *B* and *B'*.

Step 4. Repeat Steps 2 ~ 3 until all overlapping parts are identified.

### Discovering hubs

Jordan et al. first found hubs when they studied the evolution of protein and referred to the proteins with large number of partners as hubs [18]. Han et al. divided hubs into two classes: party hubs and date hubs [19]. Party hubs are hubs that interact with their partners at the same time, whereas date hubs either bind their different partners at different times or at different locations. According to their study in a network with a modular structure, date hubs always organize the proteome, while party hubs function inside modules. We propose a computational method to detect the hubs far easier.

First, we defined party hubs as those proteins that have maximal nodal weight (

*w*
_{
i
}) in a module, i.e.,

where *partly hub*
_{
r
} means a party hub of module *r*.

Date hubs are defined as proteins that bind at least three modules. We set a variable

*ACC*
_{
i
} to denote the number of modules to which

*i* is bound. The computational method of

*ACC*
_{
i
} is

where

*n*
_{
r
} is the total number of modules in the network and

*f*(

*i*) is defined as follows:

### Algorithm

**1. de-noising**

input: *G*=(*V,E*); *α*

for all nodes *i*(*i*∈*V*) in *G*

compute the clustering coefficient *CC*
_{
i
}

end

for all edges (*i,j*)((*i,j*)∈*E*) in *G*

compute the weight *SCC*(*i,j*)

if *SCC*(*i,j*)<*α*

remove edge (*i,j*) as noise

end

end

a new graph *G*'=(*V*',*E'*) is obtained

**2. clustering**

input: *G*'=(*V*',*E'*); number of nodes *n*; number of edges *m*

compute degree

*k* for all nodes and construct

*e* and

*a*
1. compute the increment of modularity Δ

*Q* for all edges

2. while (there are more than one modules)

merge the module pairs with the maximum Δ*Q*;

update *e* and *a*;

recalculate Δ*Q*;

end

3. sort all *Q* s from all iterations and choose the modular structure *M* corresponding to the largest *Q*.

4. for node *i* in *M*

if *i* belongs to module *A* and its neighbor (in *G*') *j* belongs to *B*

copy *i* to *B* and construct *B'*

if

*i* is an overlapping node between *A* and *B*

end

end

end

5. a new modular structure *M*' with overlapping parts is obtained.

**3. discovering hubs**

input: *M*'

for module *r* in *M*'

party hub_{r}=argmax w_{i},i∈r

end

for each node *i* not in any module

if *ACC*
_{
i
}≥3

*i* is a date hub

end

end