A two-layer integration framework for protein complex detection

Background Protein complexes carry out nearly all signaling and functional processes within cells. The study of protein complexes is an effective strategy to analyze cellular functions and biological processes. With the increasing availability of proteomics data, various computational methods have recently been developed to predict protein complexes. However, different computational methods are based on their own assumptions and designed to work on different data sources, and various biological screening methods have their unique experiment conditions, and are often different in scale and noise level. Therefore, a single computational method on a specific data source is generally not able to generate comprehensive and reliable prediction results. Results In this paper, we develop a novel Two-layer INtegrative Complex Detection (TINCD) model to detect protein complexes, leveraging the information from both clustering results and raw data sources. In particular, we first integrate various clustering results to construct consensus matrices for proteins to measure their overall co-complex propensity. Second, we combine these consensus matrices with the co-complex score matrix derived from Tandem Affinity Purification/Mass Spectrometry (TAP) data and obtain an integrated co-complex similarity network via an unsupervised metric fusion method. Finally, a novel graph regularized doubly stochastic matrix decomposition model is proposed to detect overlapping protein complexes from the integrated similarity network. Conclusions Extensive experimental results demonstrate that TINCD performs much better than 21 state-of-the-art complex detection techniques, including ensemble clustering and data integration techniques. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0939-3) contains supplementary material, which is available to authorized users.

where λ ≥ 0 is the tradeoff parameters that control the balance between the two factors.
To implement the optimization, we employ a relaxed Majorization-Minimization algorithm [17]. Let Φ = [ϕ iz ] be the Lagrange multipliers for constraint θ ≥ 0 and η i be the Lagrange multipliers for constraint ∑ K k=1 θ ik = 1. Therefore, the (c) TINCD Figure S4: The COMPASS complex as detected by different computational methods. The shadow area shows the complex predicted by each method, red circle nodes represent subunits of the COMPASS complex in CYC2008.
Lagrange function L is as follows: Let ∇ = ∇ + − ∇ − denote the gradient of J with respect to θ, where ∇ + and ∇ − are the positive and negative parts respectively. This suggests a fixed-point update rule for θ: Imposing ∑ K k=1 θ ik = 1, we could obtain: where The update rule for θ is shown in Algorithm 1. Once θ is initialized, we update θ according to Algorithm 1 until a stopping criterion is satisfied. In this study, we stop the iteration until the relative change of objective function is less than 1e-6 or the number of iterations reach the maximum iteration times (here we limit the maximum iteration times to be 200). Since the objective function in Equation (1) is non-convex, the final estimators of each θ depends on the initial values. To reduce the risk of local minimization, we repeat the entire updating procedure 20 times with random initialization and choose the result that gives the lowest value of the objective function (1) as the final estimator of θ, which is denoted asθ.
Among these algorithms, the performance of CFinder is determined by the size of k-clique. CMC has two key parameters called overlap threshold and merging threshold. COACH has one key parameters called ω. DPClus uses two parameters D in and CP in (D in is a value of minimum density and CP in is a minimum value for cluster property) to determine whether a neighbor should be added to the cluster. IPCA has two key parameters called T in and d. MCL has one tuning parameter called inflation. MCODE has one key parameter called node score cutoff. The performance of RRW is determined by the minimum cluster size. SPICi has one key parameter called density threshold. EC-BNMF is an ensemble clustering algorithm which has two key parameters. In this study, optimal parameters are set for CFinder, CMC, COACH, DPClus, IPCA, MCL, MCODE, RRW, SPICi and EC-BNMF to generate their best results while ClusterONE and RNSC have used the default parameters set by the authors. For CFinder, k is taking a value from 3 to 10, step size by 1, and it gets the best performance when k = 3. For CMC, the value of the overlap threshold is from 0.2 to 0.8, with a step size of 0.1, while the value of the merging threshold is from 0 to 1, with a step size of 0.1, and it achieves the best performance when both overlap threshold and merging threshold are set to 0.5. For COACH, the values of ω is set to 0.225. For DPClus, we try different values of D in and CP in (from 0.3 to 0.8 with 0.1 as the step size), and it gets the best performance when both D in is set to 0.6 and CP in is set to 0.5. For IPCA, the value of d is set to 2, while the value of T in is ranged from 0.1 to 0.9 with 0.1 as the step Algorithm 1 Pseudocode for detecting protein complexes using graph regularized doubly stochastic matrix decomposition model.
• Output: Q. // The set of predicted protein complexes.
15: Output: Q, the set of predicted protein complexes.
size, and it achieves the best performance when T in = 0.5. For MCL, the value of inflation is chosen from 1.2 to 4.9 with an interval of 0.1, and it gets the best performance when inflation is set to 1.9. For MCODE, the value of node score cutoff is searched from 0.1 to 1 with an interval of 0.1, and it gets the best performance when the node score cutoff is set to 0.2. For RRW, the minimum cluster size is set to 5. For SPICi, we try different values of density threshold, ranges from 0.1 to 1 with 0.1 increment, and it achieves the best performance when the density threshold is set to 0.5. As an ensemble clustering algorithm, the input data for EC-BNMF are a series of base clustering results which could be derived from different clustering algorithms. Therefore, in this study, we use the clustering results of the above 16 approaches as the input data for EC-BNMF. The clustering results of EC-BNMF are obtained over the best tuned parameters (K = 2000, a = 2, b = 180). The source codes for all these algorithms are obtained from the web pages provided in the corresponding papers.

Comparative results with respect to f -measure
When evaluating the predicted clusters set over a reference set, other commonly used evaluation metrics include Sensitivity, Specificity and f -measure. Given x i and y j , we consider them to be matching if |xi∩yj | 2 |xi||yj | ≥ ω and ω is set as 0.2 in our study. Let T P (true positive) be the number of the predicted complexes matched by the known complexes, and F N (false negative) be the number of the known complexes that are not matched by the predicted complexes, and F P (false positive) be the number of predicted complexes minus T P . Sensitivity, Specificity and f -measure are then defined as follows: We calculate the Sensitivity, Specificity and f -measure for each method in Table S3. As shown in Table S1, our TINCD predicts 1,562 complexes covering 5,846 proteins, which is very close to the size of input data with 5,929 proteins. However, with respect to all the three gold standard sets, our method gets low Specificity and high Sensitivity. We note that the data set used in our study contains 5,929 proteins, while the three gold standard sets (i.e., CYC2008, MIPS and SGD) cover 1,324, 1,171 and 1,154 proteins. That is, the gold standard sets are far from complete. Thus, most of our predicted complexes are not able to match the benchmark complexes. According to the definition of Specificity, these predicted complexes are treated as false positive, so TINCD achieves a low Specificity. However, predicted protein complexes that do not match with reference complexes are not necessarily undesired results and they would probably be novel protein complexes [16,12]. Therefore, optimizing Specificity and f -measure will somehow prevent us from detecting novel complexes. This is the main reason why we do not use these metrics to evaluate the performance of various methods. On the other hand, as discussed in [16,12], Accuracy and FRAC are more suitable to evaluate the performance of an overlapping protein complex detection algorithm. So we use these two metrics to evaluate the performance of various methods.

Protein complexes more accurately detected by TINCD
In this section, we will introduce several example protein complexes that are more accurately detected by TINCD. As mentioned in the manuscript, EC-BNMF and InteHC are two integrative methods that can always achieve superior performance than other computational methods, so we only list the results of TINCD, EC-BNMF and InteHC. Figure S4 shows how the COMPASS complex is found by the clustering algorithms we have studied. This complex in CYC2008 involves 8 proteins. TINCD is the only algorithm that could correctly cover all the proteins in this complex. All other algorithms make various mistakes as follows. EC-BNMF and InteHC are designed to integrate either different clustering results or diverse data sources for protein complex detection. Both of them missed 1 proteins in the COMPASS complex. In Table S2, we list more example protein complexes that are more accurately detected by TINCD.