Detecting temporal protein complexes from dynamic protein-protein interaction networks

Background Proteins dynamically interact with each other to perform their biological functions. The dynamic operations of protein interaction networks (PPI) are also reflected in the dynamic formations of protein complexes. Existing protein complex detection algorithms usually overlook the inherent temporal nature of protein interactions within PPI networks. Systematically analyzing the temporal protein complexes can not only improve the accuracy of protein complex detection, but also strengthen our biological knowledge on the dynamic protein assembly processes for cellular organization. Results In this study, we propose a novel computational method to predict temporal protein complexes. Particularly, we first construct a series of dynamic PPI networks by joint analysis of time-course gene expression data and protein interaction data. Then a Time Smooth Overlapping Complex Detection model (TS-OCD) has been proposed to detect temporal protein complexes from these dynamic PPI networks. TS-OCD can naturally capture the smoothness of networks between consecutive time points and detect overlapping protein complexes at each time point. Finally, a nonnegative matrix factorization based algorithm is introduced to merge those very similar temporal complexes across different time points. Conclusions Extensive experimental results demonstrate the proposed method is very effective in detecting temporal protein complexes than the state-of-the-art complex detection techniques. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-335) contains supplementary material, which is available to authorized users.


Supplementary Table
Here "# complexes" denotes the number of detected complexes, "avg size" and "std" denote the average size and standard deviation of the detected complexes. Here "# complexes" denotes the number of detected complexes, "avg size" and "std" denote the average size and standard deviation of the detected complexes.     Figure S7: Interaction map of DNA-directed RNA polymerase I, II, III complexes detected by MINE on BioGrid. Proteins are labeled according to the complexes they belong to: hexagon nodes represent RNA polymerase I, circle nodes represent RNA polymerase II, rectangle nodes represent RNA polymerase III, diamond nodes represent proteins shared by all the three complexes and parallelogram nodes represent proteins with other functions. Shaded areas represent the clusters detected by MINE.

Model parameter estimation for TS-OCD
The objective function of TS-OCD is as follows: where λ ≥ 0 and β ≥ 0 are the tradeoff parameters which control the balance between loss function and the regularization terms.
We utilize the multiplicative updating rule [6] to solve this nonnegative constrained optimization problem. Let ik ] be the Lagrange multipliers for constraint H (t) ≥ 0, t = 1, . . . , T . Therefore, the Lagrange function L is as follows: Taking the gradients of Lagrange function L with respect to H (t) ik , we could obtain: and for t = 2, . . . , T − 1, we have: and for t = T , we have: Since the estimators of H and for t = 2, . . . , T − 1, we have: and for t = T , we have: By the Karush-Kuhn-Tucker (KKT) conditions [5], ϕ ik = 0, so we could obtain the following equations for H (t) ik : and for t = 2, . . . , T − 1: and for t = T : Through this rule, we obtain the following updating rule for H (t) : for t = 2, . . . , T − 1, for t = T , Once each H (t) is initialized, we update H (t) according to Equations (12), (13) and (14) alternately until a stopping criterion is satisfied. Since the objective function in Equation (1) is non-convex, the final estimators of each H (t) depends on the initial values. To reduce the risk of local minimization, we repeat the entire updating procedure 10 times with random initialization and choose the result that gives the lowest value of the objective function as the final estimator. In our implementation, the iteration process stops whenever ||H To avoid the case that this process converges too slowly, we also stop it if the number of iterations reaches 200. The procedure of identifying temporal protein complexes via our algorithm is described in Fig. S4.

Convergence analysis
We solve the optimization problem of TS-OCD via multiplicative updating rules which are special cases of gradient descent with an automatic step parameter selection. It could be proved that the objective function of our model is nonincreasing during each updating process and the iterative algorithm is guaranteed to find a least locally optimal solutions. Instead of proving this in theory, we validate the convergence experimentally. For each data set, we detect how the value of objective function changes with respect to the times of iterations. Fig. S8 shows the corresponding results on DIP and BioGrid with respect to the objective function of TS-OCD. From Fig. S8, we can find that the objective function of TS-OCD decrease sharply at the beginning and then change smoothly with respect to each update. When iterating the updating process for more than 200 times, the change of the objective function is small and can be neglected. Therefore, considering the problem of efficiency, we set the maximum iteration time to be 200.

Data sets
We concentrate our study on yeast since it is a well studied model organism. The interactions derived from DIP [12] and BioGrid (version 3.1.77) [1] are used to test the performance respectively. We refer to them as DIP and BioGrid data sets. We download the BioGRID networks from the website of Nepusz et al.'s study (http://membrane.cs.rhul.ac.uk/ static/cl1/cl1_datasets.zip) [9]. To construct dynamic PPI networks, we integrate time-course gene expression data with physical PPI networks. The gene expression data are download from Gene Expression Omnibus (GEO) [2] with the accession number GSE3431 and we only use the 3552 significantly periodic genes [15]. Among the 3552 genes, 2389 occur in DIP and 3057 occur in BioGrid. Thus, we retain these genes and the corresponding interactions among them in DIP and BioGrid respectively. Table S4 lists several topological features of the two networks and shows that they have different structural characterizations. The topological differences between them can be used to test the generalization of each considered approach. These statistics are calculated using software Cytoscape [13].
We use the CYC2008 [10] and MIPS [8] benchmarks as the gold standards of yeast protein complexes. The CYC2008 catalogue is downloaded from http://wodaklab.org/cyc2008/downloads on April 6, 2013. For the MIPS gold standard, we use the dataset which has been used in [9] and can be download from http://membrane.cs.rhul. ac.uk/static/cl1/cl1$_$gold_standard.zip. For details of the construction of this benchmark, please refer to [9]. We map both the two reference sets onto each PPI network and filter them based on size in a similar manner of [9] (http://membrane.cs.rhul.ac.uk/static/cl1/additional_information.html). The two gold standards are used independently for evaluation of the methods. The general properties of the reference sets are listed in Table S5. Here "All" denotes the statistics of each reference set which is not mapped onto the PPI network and filtered in terms of size.

Evaluation metrics
To evaluate the performance for complex detection, two independent quality criteria-PR metric [14] and f -measure [7], are used to assess the similarity between the predicted complexes and the known complexes. These two metrics have complementary strengths, so they could evaluate the performance from different perspectives. Between these two measures, PR metric could judge how well the predicted complexes correspond to known complexes by considering the number of proteins in each complex as well as the overlaps between predicted complex and know complexes. While f -measure assess the performance from a macro perspective (Recall measures what fraction of the known complexes are matched by the predicted complexes, and Precision measures what fraction of the predicted complexes are matched with known complexes). We first give some notations before describing these measures. Let P denote the number of complexes detected by a particular algorithm and T denote the number of reference complexes. Let C i represents the set of proteins belong to the i-th detected complex and G j represents the set of proteins belong to the j-th reference complex.
We say a detected complex C i and a reference complex G j match each other if: where ν is an input parameter between 0 and 1 which is usually set to 0.25 [7]. Therefore, in this study, we fix ν = 0.25. Given a set of predicted complexes C = {C 1 , C 2 , · · · , C P } and a set of reference complexes Precision and Recall are defined as follows: In order to take into account of both the Precision and Recall, an integrated method called f -measure is used.
The other measure is defined as follows: PR measure: The precision-recall (PR)-based score P R i,j between a predicted complex C i and a reference complex G j is calculated by is the precision metric which measures what fraction of the proteins in predicted complex C i correspond to reference complex G j , and the second part is the recall metric which measures how much of reference complex G j is recovered by predicted complex C i . For each predicted complex C i , we find the reference complex that maximizes the PR score between them, which is defined as P RC i = max j P R i,j and for each reference complex G j , we try to find the predicted complex that maximizes the PR score between them, that is P RG j = max i P R i,j for the PR measure.
Taking average over all the predicted complexes, weighted by the size of each predicted complex, we obtain P RC as follows: Similarly, the measures P RG for the T reference complexes is P RG = . Finally, we use P R which is the harmonic mean of P RC and P RG to quantify the accuracy of the predicted complexes: . We implement the Matlab code for the calculation of the PR score according to the formulations described in [14].

Effect of random restart
Since the objective function of TS-OCD is not convex, we can not guarantee the multiplicative updating rule-based iterative algorithm will converge to the global minimum. To avoid local minimization, we repeat the entire calculation 10 times with random restarts and choose the result that gives the lowest value of the objective function. We limit the number of repetitions to be 10 because of the time cost of each repetitions. As a result, we can not guarantee the final estimator is the globally optimum solution and the result is not deterministic. We therefore focus on the variability of the results with random restarts. We repeat the entire procedure 10 times with random restarts and see how the results are affected by different restarts. For DIP and BioGrid, the corresponding results are shown in Fig. S9 and S10 respectively. From Fig. S9 and S10, we can find that, with different random restarts, the performance of TS-OCD change obviously. However, within ten random restarts, we could obtain reasonable good results. Therefore, in this study, we repeat the entire calculation 10 times with random restarts and choose the result that gives the lowest value of the objective function. Note better results will be obtained if more repetitions are conducted.

Effect of smooth regularization
To investigate the benefits of using the smooth regularization, we compare the performance of our model with and without smooth regularization (denoted as NS-OCD, Non-Smooth Overlapping Complex Detection). We apply TS-OCD and NS-OCD on DIP and BioGrid dynamic networks respectively and evaluate their performance in terms of two metrics (PR and f -measure) based on two gold standards (CYC2008 and MIPS). For NS-OCD, we also fix the value of β to be 2 4 . Fig. S1 and Fig. S2 shows the comparative performance of TS-OCD and NS-OCD on DIP and BioGrid dynamic networks using the benchmark CYC2008 and MIPS.
From Fig. S1 and Fig. S2, we can find that TS-OCD performs better than NS-OCD on both DIP and BioGrid data. For instance, on BioGrid data, the f-measure for TS-OCD and NS-OCD are 0.487 and 0.436 respectively with respect to CYC2008. That is, the complexes detected by TS-OCD have better quality than those detected by NS-OCD. In most cases, the living system is more likely to change gradually other than dramatically. Therefore, with time smooth regularization, the TS-OCD model may help to better capture the temporal behaviors of protein complexes.

Parameter settings of compared algorithms
In this paper, in order to evaluate the performance of our method in detecting protein complexes, we compare it with five existing methods: ClusterONE [9], COACH [16], MCL [3], MINE [11], and SPICi [4]. Table S6 lists the websites where we download the softwares of these algorithms and the version numbers of these softwares. Before describing the parameter settings for each algorithm, we declare several general consideration first. Since the performance of each algorithm depends on the choice of its inherent parameters and the data set under consideration, for all the considered algorithms, we optimize the parameters that yield the best results. To avoid evaluation bias, we also consider the following three criterions: • Two quality metrics (PR score and f -measure) are used to evaluate the performance of each algorithm.
• Two different gold standards (the MIPS complexes and the CYC2008 complexes) are used.
• For each algorithm, the final results are obtained by choosing the parameters that yield the best performance which are measured by the f -measure on the MIPS complexes.
We briefly review the main features of these algorithms and the setting of parameters for each algorithm in the following text.

ClusterONE
ClusterONE is recently proposed by Nepusz et al. [9] to detect overlapping protein complexes in PPI networks based on overlapping neighborhood expansion. As suggested by the authors, we do not tune the parameters for a particular network. Thus, we use the default settings of parameters in the software.
COACH COACH, as a core-attachment based method, has the following two steps to detect protein complexes from PPI networks. First, it detects local dense clusters as cores. Second, cores will be expanded to complexes by including attachment proteins that are closely connected to cores. There is a parameter ω in the first step to control the overlap between identified cores, e.g., a higher ω allow more common proteins between two different cores. In this study, we try different values of ω, ranges from 0 to 0.2 with 0.05 increment. The optimal value of ω for each PPI network is shown in Table S7.

MCL
Markov Clustering Algorithm (MCL) [3] is a competing protein complex detection algorithm and has been developed in different languages, such as JAVA, R and C. The key parameter of MCL is inflation, which tunes the granularity of clustering. Here, we try different values of inflation, ranges from 1.2 to 5.0 with 0.2 increment. The optimal value of inflation for each PPI network is shown in Table S8.    [11] can identify highly modular sets of proteins within highly interconnected PPI networks. The key parameters of MINE are node score cutoff and modularity score cutoff. We try different value of node score cutoff and modularity score cutoff (from 0.1 to 1 with 0.1 as the step size) and 3 settings of depth limit (3,4,5). For the other parameters, without stating, we use the default values in the software. The optimal values of the parameters of MINE for each PPI network are listed in Table S9.

SPICi
SPICi [4] is a computationally efficient local network clustering algorithm for large biological networks, which can be applied on PPI networks for complex detection. SPICi has two parameters: the density threshold and the support threshold. Here, we try different values of density threshold, ranges from 0.1 to 1 with 0.1 increment. For the other parameters, we use the default settings in the software. Table S10 lists the optimal value of density parameter for each PPI networks.