The PPI network can be represented generally as an undirected graph, and the proteins are treated as nodes and the interactions are considered as edges. Here, the static PPI networks are converted into the multi-relation reconstructed dynamic PPI networks. And then we apply IFPA to add attachments to the appropriate cores based on the core-attachment structure.
Building multi-relation reconstructed dynamic PPI networks
The availability of gene expression data enables researchers to reveal the dynamics of molecular networks and improve the identification of protein complexes [22,23,24]. Hence, based on the study [25], the time course gene expression data is integrated into original static PPI networks to generate dynamic PPI subnetworks so that we can capture the dynamics of protein complexes, that is to say, we split the original static PPI network (OSPN) into twelve dynamic PPI subnetworks (DPSNs), in which all interactions in a DPSN can occur simultaneously, and then perform complex discovery on each DPSN.
First, we use three-sigma method [25] in order to construct dynamic PPI subnetworks with time series gene expression data. The gene expression data involves three metabolic cycles and each cycle contains twelve timestamps. A protein v is considered to be active in DPSN if its gene expression value is not less than the active threshold Active_Th(v):
$$ Active\_ Th(v)=\mu (v)+\kern0.5em 3\sigma (v)\left(1-F(v)\right) $$
(1)
$$ F(v)=\kern0.5em \frac{1}{1+\kern0.5em {\sigma}^2(v)} $$
(2)
where μ(v) is the algorithmic mean of gene expression values of v over times 1 to n and σ(v) is the standard deviation of its gene expression values. For each protein, three-sigma method is used to calculate the active threshold Active_Th(v). A original PPI network can be described as an undirected graph G(V, E), where V denotes node set that are proteins and E presents edge set that are their connections. And the dynamic PPI network can be represented as Gt (Vt, Et) at timestamp t (t = 1,2, …, n). At a certain time point, if two proteins vi and vj are active and interact with each other in the original static PPI network, then there is a connection between protein vi and vj in a DPSN. After that, twelve dynamic PPI subnetworks are constructed from the original static PPI network.
Moreover, integrating heterogeneous data source into a single network can enhance the reliability of networks, which inspires us that assigning the suitable weights to edges can strengthen the confidence of interactions, and the implementation will be discussed in the following. Figure 1 illustrates an example of multi-relation reconstructed dynamic PPI networks construction.
-
Definition 1 (Co-essentiality) Essential proteins are indispensable for the survival of an organism. Then we can believe that the interaction between two essential proteins is also necessary. Hence, a concept based on essential protein is extended to measure the essentiality between two proteins, and the essentiality values are considered as their weights.
Before giving the concept of co-essentiality, we first elaborate the definition of an essential edge. Given two proteins vi and vj, the edge between them is considered as an essential edge if both of vi and vj are the essential proteins, similarly, the edge between them is considered as an uncertain edge if vi or vj is the essential protein, and the edge between them is considered as a nonessential edge if neither of vi and vj is the essential protein. Only the essential edges are taken into account to reconstruct the networks here. And eeij is the essential edge between vi and vj, the co-essentiality between these two proteins can be represented as follows.
$$ co- essentiality\kern0.5em \left(i,j\right)=\kern0.5em \frac{ESS_{ij}}{sum\kern0.5em \left({ESS}_j\right)} $$
(3)
where ESSij denotes the weight value of essential edge which equals to one and sum(ESSj) denotes the sum of the weight values of a column.
-
Definition 2 (Co-localization) Given two interacting proteins vi and vj, the interaction between them will be more reliable if vi and vj exist in same subcellular location, its co-localization is defined by the following equation.
$$ co- localization\kern0.5em \left(i,j\right)=\kern0.5em \frac{{\left|{SCL}_i\kern0.5em \cap \kern0.5em {SCL}_j\right|}^2}{\mid {SCL}_i\mid \cdot \mid {SCL}_j\mid } $$
(4)
where |SCLi| and |SCLj| are the number of subcellular location of proteins vi and vj, respectively.
-
Definition 3 (Co-annotation) Given two interacting proteins vi and vj, they have the similar function if there are some common GO annotations between vi and vj, its co-annotation is calculated as follows.
$$ co- annotation\left(i,j\right)=\frac{{\left|{GO}_i\cap {GO}_j\right|}^2}{\left|{GO}_i\right|\cdot \left|{GO}_j\right|} $$
(5)
where |GOi| and |GOj| are the number of GO annotations of proteins vi and vj, respectively.
$$ co- cluster\left(i,j\right)=\frac{Z_{ij}}{\mathit{\min}\left\{\left|{N}_i\right|-1,\left|{N}_j\right|-1\right\}} $$
(6)
where Zij represents the number of triangles built on edge (vi, vj), |Ni| and |Nj| are the degrees of protein vi and vj, respectively.
Multiple relation defined above are used to weight the networks. The multi-relation value between vi and vj is stands for as follows.
$$ multi- relation\left(i,j\right)= co- essentiality\left(i,j\right)+ co- localization\left(i,j\right) $$
$$ + co- annotation\left(i,j\right)+ co- cluster\left(i,j\right) $$
(7)
These multi-relation values are regarded as the weights of edges W(i,j) to upgrade the credibility of interactions. For an edge, its normalized W(i,j) value NW(i,j) is expressed by the following formula.
$$ NW\left(i,j\right)=\frac{multi- relation\left(i,j\right)}{num\left( multi- relation\right)} $$
(8)
where num(multi-relation) is the total number of the network relations, i.e., the four kinds of relations including coessentiality, colocalization, coannotation, cocluster and the networks are reconstructed by mixing them. Eventually, the dynamic PPI subnetworks (DPSNs) are switched into the multi-relation reconstructed dynamic PPI networks (MRDPNs).
Finding cores
As we all know that protein complex core should be a densely connected subgraph in the PPI network. Thus, we pick the seed proteins in the first stage, and extend seed proteins to the cores in the second stage.
-
Definition 5 (Weighted Degree) The proteins with weighted degree greater than average weighted degree are sorted in descending order as the candidate core set CC. The weighted degree of a protein i in the MRDPN is the number of interactions in which this protein is involved, which can be expressed as follows.
$$ Weighted\ Degree(i)=\sum \limits_j interactions\left(i,j\right) $$
(9)
Let first node in the candidate core set CC be a seed protein which plays an irreplaceable role in protein complex. The neighbors of the seed protein are inserted into a core set when the condition that the density of core set is greater than a given threshold DT is satisfied. The threshold DT will be discussed in the next section.
$$ Density(CS)=\frac{2\times {\sum}_{\left(i,j\right)} NW\left(i,j\right)}{\left| CS\right|\cdot \left(\left| CS\right|-1\right)} $$
(10)
where |CS| denotes the number of nodes in core set. Initially, core set CS contains one seed protein i. A neighbor of seed protein is added to the core set if adding it can make the Density(CS) greater than the threshold DT. This process is repeated until all neighbors of seed protein are sought and the predicted core is generated. Once a complex core is completed, all nodes in it will be labeled with “1” and cannot be extended into any other complex cores. This process will stop when the CC is empty.
Finding peripheries
Since the core plays a central role, the periphery plays a supporting role. The key idea behind our presented IFPA algorithm is to utilize the pollination mechanism to mimic the process of pollen falling on suitable flowers, which is completely different from other general methods. In this subsection, we first give a brief introduction to the flower pollination algorithm (FPA) [19], and then we find the optimal cores for peripheries by ameliorating it.
FPA is a nature-inspired optimization algorithm that comprises two main patterns, that is global pollination and local pollination. The global pollination can be represented as:
$$ {x}_i^{t+1}={x}_i^t+L\left({x}_i^t-G\right)\ (11) $$
where \( {x}_i^t \) is the pollen i at iteration t, and G is the current best solution. The parameter L is the strength of the pollination, namely a step size, we use a Lévy flight to represent that insects move over a long distance with various distance steps. That is, L is greater than 0 and obeys the Lévy distribution:
$$ L\sim \frac{\lambda \varGamma \left(\lambda \right)\sin \left(\pi \lambda /2\right)}{\pi}\frac{1}{s^{1+\lambda }},\left(s\gg {s}_0>0\right)\ (12) $$
where Γ(λ) is the standard gamma function. The local pollination can be defined as:
$$ {x}_i^{t+1}={x}_i^t+\varPsi \left({x}_j^t-{x}_k^t\right)\ (13) $$
where \( {x}_j^t \) and \( {x}_k^t \) are pollen from the different flowers of the same plant species. This substantially simulates the flower constancy in a limited neighborhood. Mathematically, if \( {x}_j^t \) and \( {x}_k^t \) come from the same plant species or select from the same population, this can be seen as a local random walk if Ψ obeys the uniform distribution of 0 to 1.
Then, we use IFPA which is an advanced version of FPA algorithm to find the closest cores for peripheries, which is equivalent to finding the most satisfactory flowering plants for pollen. The workflow of IFPA algorithm is shown in Fig. 2. Those proteins not included in the core set are considered as the candidate pollen. In IFPA algorithm, the pollen corresponds to attachments and the pollination plants correspond to cores. The pollen position equals the core sequence numbers. The update of pollen position is expressed as follows.
$$ {S}_{i,j}^{t+1}=\left\{\begin{array}{c}\ {S}_{i,j}^t, if\ {Pollination\ Priority}_{i,j}> Thr\ \\ {}\ randperm\left( Num,d\right), otherwise\end{array}\right.\ (14) $$
where Thr denotes a threshold and it is set as 0.2 here. The function of randperm is to return an integer from one to Num which means to find new core sequence number, and Num is the number of cores and the value of d is one.
Definition 7 (Pollination Priority) As a part of an entire protein complex, the attachments maintain relatively close relationship with the core, we call this relationship as pollination priority. The “pollination priority” of a pollen to its core set CS is represented as follows.
$$ Pollination\ Priority\left( pollen, CS\right)=\sum \limits_{u\in CS} co- cluster\left( pollen,u\right)\ (15) $$
where u is the protein in core set CS. The pollination priority depends on the affinity between the pollen and the flowers. The closer the relationship between pollen and a flower, the higher priority it pollinates on this flower. In the update procedure, if the pollen can find a flower that makes the value of pollination priority better, then the pollen falls on this flower, otherwise, the pollen finds a new flower to pollinate.
Finally, we further merge all the candidate protein complexes mined in twelve subnetworks and filter highly overlapping complexes, as our final predicted protein complexes. Algorithm 1 outlines the implementation process of our IFPA method.