Construction of dynamic probabilistic protein interaction networks for protein complex identification

Zhang, Yijia; Lin, Hongfei; Yang, Zhihao; Wang, Jian

doi:10.1186/s12859-016-1054-1

Research article
Open access
Published: 27 April 2016

Construction of dynamic probabilistic protein interaction networks for protein complex identification

Yijia Zhang¹,
Hongfei Lin¹,
Zhihao Yang¹ &
…
Jian Wang¹

BMC Bioinformatics volume 17, Article number: 186 (2016) Cite this article

2307 Accesses
22 Citations
2 Altmetric
Metrics details

Abstract

Background

Recently, high-throughput experimental techniques have generated a large amount of protein-protein interaction (PPI) data which can construct large complex PPI networks for numerous organisms. System biology attempts to understand cellular organization and function by analyzing these PPI networks. However, most studies still focus on static PPI networks which neglect the dynamic information of PPI.

Results

The gene expression data under different time points and conditions can reveal the dynamic information of proteins. In this study, we used an active probability-based method to distinguish the active level of proteins at different active time points. We constructed dynamic probabilistic protein networks (DPPN) to integrate dynamic information of protein into static PPI networks. Based on DPPN, we subsequently proposed a novel method to identify protein complexes, which could effectively exploit topological structure as well as dynamic information of DPPN. We used three different yeast PPI datasets and gene expression data to construct three DPPNs. When applied to three DPPNs, many well-characterized protein complexes were accurately identified by this method.

Conclusion

The shift from static PPI networks to dynamic PPI networks is essential to accurately identify protein complex. This method not only can be applied to identify protein complex, but also establish a framework to integrate dynamic information into static networks for other applications, such as pathway analysis.

Background

Recent advances in high-throughput experimental techniques such as yeast two-hybrid and mass spectrometry have generated a large amount of protein-protein interaction (PPI) data [1, 2]. These available PPI data have constructed large complex PPI networks for numerous organisms, such as Saccharomyces cerevisiae. PPIs are of central importance for most biological processes, and thus PPI networks can provides a global picture of cellular mechanisms. A key task of system biology is to reveal cellular organization and function by analyzing the PPI networks. Protein complexes are molecular aggregations of two or more proteins assembled by multiple PPIs, which play critical roles in many biological processes. Most proteins are only functional after assembly into protein complexes. Accurate determination of protein complexes in large PPI networks is crucial for understanding principles of cellular organization and function from the networks level [3].

Over the past decade, great effort has been made to identify protein complexes in PPI networks. As protein complexes are groups of proteins that interact with each other, they are generally dense subgraph in PPI networks. Some computational methods based on graph theory or dense regions finding have been proposed to identify protein complexes from PPI networks. The molecular complex detection (MCODE [4]) algorithm proposed by Bader and Hogue was one of the first computational methods reported based on graph theory. Markov Clustering (MCL) [5] can also be applied to identify protein complexes by simulating random walks in PPI networks, which manipulates the weighted or unweighted adjacency matrix with two operators called expansion and inflation. Qi et al. [6] proposed a supervised-learning framework to predict protein complexes, which can learn topological and biological features from known protein complexes. Adamcsek et al. [7] developed the CFinder tool to find functional modules in PPI networks, which use the clique percolation method [8] to detect k-clique percolation clusters. Moschopoulos et al. proposed a clustering tool (GIBA) to detect protein complexes [9], which involves two phases. Firstly, GIBA uses a clustering algorithm such as MCL and RNSC to cluster the given PPI networks. Then, GIBA filters the clustering results to generate the final complexes based on a combination method. Liu et al. [10] proposed a clustering method based on Maximal cliques (CMC) to detect protein complexes. Based on core-attachment structural features [11], Wu et al. [12] developed the COACH algorithm which identifies protein-complex cores and protein-complex attachments respectively. Zaki et al. proposed ProRank method which uses a protein ranking algorithm to identify essential proteins in a PPI network and predicts complexes based on the essential proteins [13]. Chin et al. proposed a hub-attachment based method called HUNTER to detect functional modules and protein complexes from confidence-scored protein interactions [14]. Since proteins may have multiple functions, they may belong to more than one protein complex. Nepusz et al. [15] proposed the ClusterONE algorithm which detected overlapping protein complexes in PPI networks. High-throughput experimental PPI data always is the high incidence of both false positives and false negatives [3]. Since the computational methods are highly dependent on the quality of the PPI data, the performance of complex predictive models are clearly limited by the noise of the high-throughput PPI data. Some studies have integrated other biomedical resources to improve the performance of protein complex identification. For instance, Zhang et al. [16] proposed the COAN algorithm based on ontology augmentation networks constructed with high-throughput PPI and gene ontology (GO) annotation data, which can takes into account the topological structure of the PPI network, as well as similarities in GO annotations.

So far most studies on protein complex identification only focused on static PPI networks. However, cellular systems are highly dynamic and responsive to cues from the environment [17, 18]. PPI network in a cell changes over time, environments and different stages of cell cycle [19, 20]. PPIs can be classified into permanent or transient PPIs based on their lifetime. Permanent PPIs are usually stable and irreversible. On the contrary, transient PPIs mostly dynamical change interaction partners and their lifetime are short. Protein complexes are groups of two or more associated polypeptide chains at the same time. One major problem of protein complex identification is the static PPI networks cannot provide temporal information and do not reflect the actual situation in a cell [21]. It is very difficult to identify complex accurately from the static PPI networks.

To address this problem, the shift from static PPI networks to dynamic PPI networks is essential for protein complex identification and other similar applications. The gene expression data under different time points and conditions can reveal the dynamic information of protein. Some studies have integrated gene expression data to reveal the dynamics of PPI. For example, Lin et al. [22] revealed dynamic functional modules under conditions of dilated cardiomyopathy based on co-expression PPI networks. Taylor et al. [23] analyzed the human PPI networks and discovered two types of hub proteins: intermodular hubs and intramodular hubs. Zhang et al. [24] used the Pearson correlation coefficient to calculate the coexpression correlation of gene expression data and built coexpression protein networks at different time points. Recently, Hanna et al. proposed a framework termed DyCluster to detect complexes based on PPI networks and gene expression data [25]. Firstly, DyCluster uses biclustering techniques to model the dynamic aspect of PPI networks by incorporating gene expression data. Then, DyCluster applies complex-detection algorithms, such as ClusterONE [15] and CMC [10], to detect the complexes from the dynamic PPI networks.

In general, the inevitable background noise exists in the gene expression data. How to identify the active time point of each protein based on gene expression data is crucial for constructing dynamic PPI networks. In this study, we proposed a novel method to calculate the active probability of proteins at different time points. Furthermore, we constructed dynamic probabilistic PPI networks (DPPN) to integrate gene expression data and PPI data based on attributed graph theory, and proposed a clustering method to identify protein complex from DPPN. There are two key differences between our method and DyCluster. Firstly, the DPPN constructed by our method can effectively distinguish the active level of a protein at a time point which is of benefit to the complex identification. Secondly, our method doesn’t directly apply other complex-detection algorithms, but proposes a new clustering method for the characteristics of DPPN. We demonstrated the utility of the method by applying it to three different yeast PPI datasets and gene expression data. Three DPPNs were constructed and many well-characterized protein complexes were accurately identified. In addition, the method was compared with current protein complexes identification methods. The advantages of the method, potential applications and improvements were discussed.

Methods

Calculation of active probability for proteins

Since a protein has its active periods in the cell [17, 18], the protein and its interactions appear and disappear in the PPI networks in a living cell. Gene expression data can reflect the dynamic information of proteins varying with the time points or conditions. In general, the expression level of a protein will be decreased after the protein has completed its function. Therefore, a protein is active at the time point, when the related gene expression data is at the high level.

A simple idea is to use a single global threshold for identifying the active time point of each protein. If the gene expression value of a gene is higher than the global threshold at a time point, the gene is considered as expressed at that time point. However, the expression level of genes in activity period is different. Wang et al. [26] proposed a three-sigma method to identify active time points of each protein in a cellular cycle. The standard deviation (SD) is a statistical value which can measure how data are dispersed around their average. Let X be a real random variable of normal distribution N(α,σ²), which describes for each individual gene its distribution of gene expression values across time. For any k > 0, P{|X − α| < kσ} = 2Φ(k) − 1, where Φ(.) is the distribution function of the standard normal law. In particular, for k = 1, 2, 3 it follows that P{|X ‐ α| < σ} = P{α − σ < X < α + σ} ≈ 0.6827, P{|X ‐ α| < 2σ} ≈ 0.9545 and P{|X ‐ α| < 3σ} ≈ 0.9973. Based on the above empirical rules, Wang et al. [26] designed an active threshold for each gene by calculating its own characteristic gene expression data, and constructed dynamic PPI networks. Then, they tested some complex prediction methods, such as MCL [5], on the dynamic PPI networks. In this paper, we proposed a novel method to construct DPPN based on the three-sigma method [26]. Compared with the three-sigma method [26], our method can effective distinguish the active level of a protein at a time point. Furthermore, we also proposed a new clustering method to identify complexes for the characteristics of DPPN.

In fact, gene expression data always includes inevitable noise. The active proteins with low expression values are likely to be filtered out even though using an active threshold for each gene. To deal with this problem, we calculate the active probability of each protein at different time points based on three-sigma method. Gene expression data often contain expression profiles of n time points. Let G_i(p) be the gene expression value of gene p at the time point i. Let α(p) and σ(p) be the algorithmic mean and SD of gene expression data G(p), respectively.

$$ \alpha (p)=\frac{{\displaystyle {\sum}_{i=1}^n{G}_i(p)}}{n} $$

(1)

$$ \sigma (p)=\sqrt{\frac{{\displaystyle {\sum}_{i=1}^n{\left({G}_i(p)\hbox{-} \alpha (p)\right)}^2}}{n\hbox{-} 1}} $$

(2)

Since different genes correspond to different expression curves, we calculate the active probability of a protein based on the algorithmic mean and SD of the corresponding gene. Firstly, the k-sigma (k = 1,2,3) threshold can be calculated based three-sigma method [20] as follows:

$$ Ge\_ thres{h}_k(p)=\alpha (p)+k\cdotp \sigma (p)\cdotp \left(1-\frac{1}{1+{\sigma}^2(p)}\right) $$

(3)

Ge_thres _k is the active threshold of gene p which is determined by the values of α(p),σ²(p) and k (the times of sigma). If σ²(p) is very low, it indicates that the fluctuation of the expression curve of gene p is also very small and the value of G_i(p) tends to be very close to α(p). In this case, the value of Ge_thresh _k is close to α(p). If σ²(p) is very high, it indicates that the value of G_i(p) is spread out over a large range of values. A large σ²(p) generally indicates much noise in the gene expression data of gene p. In this case, the value of Ge_thresh _k is close to α(p) + k · σ(p). Note that the range of k (the times of sigma) is in (0, 3), while 3 is the maximum times of sigma. The larger k is, the higher Ge_thresh _k gets. If we choose a larger k, the active proteins filtered by Ge_thresh _k will be with higher confidence. For instance, based on three-sigma rules, when G_i(p)> α(p) + 3 · σ(p), the probability that the protein p (product of gene p) is active at the i time point is 99.7 %, but when G_i(p) > α(p) + σ(p), the probability that the protein p (product of gene p) is active at the i time point is only 68.3 %. Based on the Ge_thresh _k, we calculate the active probability of a protein in the i time point as follows.

$$ { \Pr}_i(p)=\left\{\begin{array}{cc}\hfill 0.99\hfill & \hfill if\ {G}_i(p)\ge Ge\_ thres{h}_3(p)\hfill \\ {}\hfill \begin{array}{c}\hfill 0.95\hfill \\ {}\hfill 0.68\hfill \\ {}\hfill 0\hfill \end{array}\hfill & \hfill \begin{array}{c}\hfill if\ Ge\_ thres{h}_3(p)>{G}_i(p)\ge Ge\_ thres{h}_2(p)\hfill \\ {}\hfill if\ Ge\_ thres{h}_2(p)>{G}_i(p)\ge Ge\_ thres{h}_1(p)\hfill \\ {}\hfill if\ {G}_i(p)<Ge\_ thres{h}_1(p)\hfill \end{array}\hfill \end{array}\right. $$

(4)

In the equation (4), the active probability of a protein contains four levels based on the sigma rules (P{|X ‐ α| < σ} ≈ 0.6827, P{|X ‐ α| < 2σ} ≈ 0.9545 and P{|X ‐ α| < 3σ} ≈ 0.9973). In particular, if the value of G_i(p) is lower than Ge_thres ₁(p), the active probability is 0. This indicates that the protein p is not active in the i time point. In general, the active probability value of a protein can represent its active level at a time point. Thus, we can distinguish the active level of a protein at a time point based on its active probability. Neither global threshold method nor active threshold method can effectively distinguish the active level of a protein at a time point based on gene expression data. Based on the active probability of a protein, we can not only effectively identify the active time point of the protein, but also distinguish the active level of the protein.

Construction of DPPN

Since the active periods of proteins are different, the real PPI networks are changing over the time in a living cell. We can calculate the active probability of proteins at each time point based on gene expression data. In this section, we construct DPPN by integrating the active information of proteins into static PPI networks based on attributed graph theory.

We define a DPPN as a 7-tuple G = (V, E, A, P, Fv, Fe, Fp) where V is the set of protein vertices, _E is the set of PPIs, A = {T1, T2, … Tn} is the set of active time points for protein vertices, and P = {P1, P2, P3} is the set of active probability for protein vertices at each active time point. F _v is a function that returns the set of active time attributes of a protein vertex. Each protein vertex v _i in V has a set of active time attributes F _v(v _i) = {T _i ₁, T _i ₂, …, T _im}, where m = |F _v(v _i)| and F _v(v _i) ⊆ A. Likewise, Fp(v _i, T _ij) = Pk is a function that returns active probability P _k for the protein vertex v _i at T _ij time point. In this study, the active probability set P includes three values P ₁ = 0.99, P ₂ = 0.95, and P ₃ = 0.68, respectively. Each PPI e(v _i,v _j) in E also has a set of active time attributes Fe(e(v _i, v _j)) = F _v(v _i) ∩ F _v(v _j) and Fe(e(v _i, v _j)) ≠ ∅.

Figure 1 shows an example of DPPN construction. Figure 1a is a static PPI networks based on high-throughput PPI data, which consist of eight proteins. Figure 1b shows a part of gene expression value of protein v ₁. From Fig. 1b, it can be seen that the gene expression value at T1 and T5 protein v ₁ are significantly higher than at T2, T3 and T4. According to the equation (4), Ge_thresh ₂ > G_T1(v ₁) > Ge_thresh ₁ at the time point T1, and G_T5(v ₁) > Ge_thresh ₃ at the time point T5. Therefore, the active probability of protein v ₁ are P3 (0.68) and P1 (0.99) at the time point T1 and T5, respectively. Figure 1c lists the active time attributes and active probability of all protein vertices in Fig. 1a. It can be seen that each protein vertex has an active time attribute set. For instance, v ₁ has two active time attributes (T1 and T5), and v ₂ has three active time attributes (T1, T2 and T4). In particular, each protein vertex has an active probability at an active time attribute. In Fig. 1c, the active probability of v ₁ is P3 (0.68) and P1 (0.99) at the T1 and T5 time points, respectively. Figure 1d shows a DPPN constructed based on Fig. 1a and c. Each edge in DPPN has an active time attributes set. For example, e ₁ represents the PPI between v ₁ and v ₂ . The active time attributes sets of v ₁ and v ₂ are {T1,T5} and {T1, T2, T4} based on Fig. 1c, respectively. The active time attribute set of e ₁ is {T1} which is calculate by {T1, T5}∩{T1, T2, T4}. If the active time attribute set of an edge is empty, the edge will not appear in DPPN.

Protein complex identification from DPPN

Compared to static PPI networks, DPPN can effectively represent not only the topological structure but also the dynamic information of PPI networks. Since protein complexes are groups of proteins that interact with each other in the same time [2, 3], they are generally dense subgraph associated with the same active time attributes in DPPN. The edges in DPPN contribute differently for protein complex identification task. Given a DPPN G, the topology score of edge e(v _i ,v _j ) is defined as follows:

$$ \mathrm{Topology}\_\mathrm{score}\left(e\left(vi,vj\right)\right)=\frac{\left|{N}_i\cap {N}_j\right|+1}{ \max \left\{Avg.(G),\left|{N}_i\right|\right\}+ \max \left\{Avg.(G),\left|{N}_j\right|\right\}} $$

(5)

$$ Avg.(G)=\frac{{\displaystyle {\sum}_{v_k\in V}\left|{N}_k\right|}}{\left|V\right|} $$

(6)

where N _i and N _j denote the neighbors of v _i and v _j respectively. |N _i ∩N _j | denotes the common neighbors of v _i and v _j, and Avg.(G) calculates the average degree of the DPPN G. If v _i and v _j share more common neighbors, the topology score will be larger. Max{Avg.(G), |N _i|}can penalize protein v _i with very few neighbors effectively [10]. Based on the topology weight, the weight of edge e(v _i ,v _j ) at the k active time point is given as:

$$ \mathrm{Weight}\left(ek\left(vi,vj\right)\right)=\mathrm{Topology}\_\mathrm{score}\left(e\left(vi,vj\right)\right)\cdotp Pk(vi)\cdotp Pk(vj) $$

(7)

where P_k(v _i) and P_k(v _j) are the active probability of v _i and v _j at the k time point, respectively. The equation (7) can consider not only the topological structure but also the dynamic information of DPPN. Since the active probability of v _i and v _j is likely different at different active time point, the weight of edge e(v _i ,v _j ) dynamically changes during all active time points.

Definition 1 - Active correlated clique. Given a protein vertex set C and an edge set E _c in DPPN G, an active correlated clique is a pair ((C, E _c), A _c), such that for each protein vertex v _i in C, the degree of v _i is |C|-1. A _c is the common active time attribute set of each protein vertex v _i in C and Ac ≠ ∅.

In general, we can mine many Active correlated cliques in a DPPN. Figure 2 shows two active correlated cliques of the DPPN in Fig. 1.

Definition 2 – Active clique score. Given an active correlated clique ((C, E _c), A _c), the Active clique score of ((C, E _c), A _c) at the k (k∈A _c) active time point, is given as:

$$ \mathrm{Clique}\_\mathrm{score}\left(\left(C,Ec\right),Ac\right)=\mathrm{Clique}\_ \Pr .\left(\left(C,Ec\right),Ac\right)\cdotp {\displaystyle {\sum}_{eij\in Ec}\mathrm{Toplogy}\_\mathrm{score}(eij)} $$

(8)

$$ \mathrm{Clique}\_ \Pr .\left(\left(C,Ec\right),Ac\right)= \max \left\{{\displaystyle {\prod}_{vi\in C}Pk}(vi),k\in Ac\right\} $$

(9)

where P_k(v _i) is the active probability of v _i at the k time point. ∏_vi ∈ C Pk(vi) calculates the active probability of clique((C, E _c), A _c) at the k time point. Clique_Pr. ((C, E _c), A _c) choose the maximum ∏_vi ∈ C Pk(vi) as the active probability for the clique from all the common active time points. Therefore, active probability of an active correlated clique is associated with an unique active time point. We can use ((C, E _c), T _c) to denote an active correlated clique which gets the clique probability at T _c active time point. Clique score provides a reasonable combination of topology connectivity and the dynamic active attributes of DPPN. If an active correlated clique is associated with a large clique score, this indicates that the proteins of the clique are all in dense subgraph structure of DPPN as well as highly active at a same time point. Therefore, the clique score can effectively evaluate how possible an active correlated clique is the core structure of a protein complex.

Gavin et al. [11] revealed the core-attachment structure of protein complex by genome-wide analyzing yeast complexes. Based on core-attachment structure assumption, our method for protein complex identification from DPPN involved two phases. In the first phases, we identified the core structure of protein complexes from DPPN. In the second phases, we augmented the protein complex from the core structure by adding the close neighbor proteins.

In the first phase, we used the cliques mining algorithm [27] to enumerate all maximal cliques which contain three or more proteins from DPPN, and calculated the common active time attribute set for each maximal clique. If the common active time attribute set was not empty, the maximal clique was an active correlated clique. The candidate core set Candidate_CORE was comprised of all active correlated cliques, which generally overlapped. We used equation (8) to calculate the active clique score for all active correlated cliques in Candidate_CORE, and ranked them in descending order of active clique score, denoted as {((C, E _c1), T _c1), ((C, E _c2), T _c2),…,((C, E _cn), T _cn)}. The top ranked clique((C, E _c1), T _c1) was then deleted from Candidate_CORE and inserted into the core set CORE. To ensure that the active correlated cliques in CORE were non-overlapping, we used the same method [10] to remove or prune overlapping cliques until the candidate core set Candidate_CORE was empty. In this way, we could generate core structures for most protein complexes. However, some protein complexes are with low density or only contain two proteins [28, 29]. To solve this problem, we added some edges with high weight score to the core set CORE. We used the equation (7) to calculate the weight for the edges which were not contained in all active correlated cliques. If the weight of an edge was larger than the predefined threshhold core_thresh, we directly added the edge to core set CORE. Therefore, we chose not only active correlated cliques but also the edges associated with high weight score as core structures of protein complexes.

In the second phase, we augmented the core structure by adding each close neighbor protein one by one. We used attached score to measure how closely a protein v _k with active time attribute A _k was connected to a core structure ((C, E _c), T _c), where vk ∉ C and Tc ∈ Ak. The attached score of v _k with respect to ((C, E _c), T _c) is given as:

$$ Attach\_\mathrm{score}\left(\left(vk,Ak\right),\left(\left(C,Ec\right),Tc\right)\right)=\frac{{\displaystyle {\sum}_{vi\in C} Weight\left({e}_{Tc}\left(vi,vk\right)\right)}}{\left|C\right|} $$

(10)

If the Attach_score was larger than extend_thresh, then v _k was added to the core structure ((C, E _c), T _c). Therefore the final identified protein complexes were generated by adding the close neighbor proteins to the core structure. Here, extend_thresh was a predefined threshold. The optimal value of extend_thresh and core_thresh can usually be determined in preliminary experiments.

Results and discussion

In this section, the datasets and evaluation metrics used in the experiments are described. The impact of the core_thresh and extend_thresh parameters are assessed. Finally, our method is compared with current state-of-the-art protein complex identification methods.

Datasets and evaluation metrics

The three high-throughput PPI datasets used in our experiment were the Krogan dataset [30], DIP dataset [31] and MIPS dataset [32], respectively. The statistics of the three yeast PPI datasets is listed in Table 1. The benchmark protein complex datasets are CYC2008 [28] and MIPS2006 [33], which consist of 408 and 217 protein complexes, respectively.

Table 1 The statistics of high-throughput PPI datasets in experiments

Construction of dynamic probabilistic protein interaction networks for protein complex identification

Abstract

Background

Results

Conclusion

Background

Methods

Calculation of active probability for proteins

Construction of DPPN

Protein complex identification from DPPN

Results and discussion

Datasets and evaluation metrics

The effect of threshhold parameters

Comparison with other methods

Golgi transport complex identified by our method

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us