Improving prediction of heterodimeric protein complexes using combination with pairwise kernel

Ruan, Peiying; Hayashida, Morihiro; Akutsu, Tatsuya; Vert, Jean-Philippe

doi:10.1186/s12859-018-2017-5

Research
Open access
Published: 19 February 2018

Improving prediction of heterodimeric protein complexes using combination with pairwise kernel

Peiying Ruan¹,
Morihiro Hayashida²,
Tatsuya Akutsu³ &
…
Jean-Philippe Vert^4,5,6,7

BMC Bioinformatics volume 19, Article number: 39 (2018) Cite this article

2258 Accesses
8 Citations
1 Altmetric
Metrics details

Abstract

Background

Since many proteins become functional only after they interact with their partner proteins and form protein complexes, it is essential to identify the sets of proteins that form complexes. Therefore, several computational methods have been proposed to predict complexes from the topology and structure of experimental protein-protein interaction (PPI) network. These methods work well to predict complexes involving at least three proteins, but generally fail at identifying complexes involving only two different proteins, called heterodimeric complexes or heterodimers. There is however an urgent need for efficient methods to predict heterodimers, since the majority of known protein complexes are precisely heterodimers.

Results

In this paper, we use three promising kernel functions, Min kernel and two pairwise kernels, which are Metric Learning Pairwise Kernel (MLPK) and Tensor Product Pairwise Kernel (TPPK). We also consider the normalization forms of Min kernel. Then, we combine Min kernel or its normalization form and one of the pairwise kernels by plugging. We applied kernels based on PPI, domain, phylogenetic profile, and subcellular localization properties to predicting heterodimers. Then, we evaluate our method by employing C-Support Vector Classification (C-SVC), carrying out 10-fold cross-validation, and calculating the average F-measures. The results suggest that the combination of normalized-Min-kernel and MLPK leads to the best F-measure and improved the performance of our previous work, which had been the best existing method so far.

Conclusions

We propose new methods to predict heterodimers, using a machine learning-based approach. We train a support vector machine (SVM) to discriminate interacting vs non-interacting protein pairs, based on informations extracted from PPI, domain, phylogenetic profiles and subcellular localization. We evaluate in detail new kernel functions to encode these data, and report prediction performance that outperforms the state-of-the-art.

Background

Many proteins carry out their biological functions by interacting with other proteins to form multiprotein structures, called protein complexes [1], which are crucial for a broad range of the biological process. For example, the ribosome is an assembly of protein and RNA subunits responsible for protein translation. Therefore, understanding protein functions, as well as biological processes, requires identification of sets of proteins that form complexes. A significant fraction of known protein complexes are heterodimeric protein complexes (heterodimers), that is, formed by the assembly of two different proteins. For example, the two most important protein complex catalogs CYC2008 [2] and MIPS [3] include respectively 172 (42%) and 64 (29%) heterodimers. Hence, it is necessary to develop accurate methods for predicting heterodimers. Here CYC2008 is a comprehensive catalog of 408 manually curated yeast protein complexes reliably supported by small-scale experiments, and MIPS provides detailed information involving classification schemes for analysis of protein sequences, RNA genes, and other genetic elements [4–6].

Several high-throughput methods have supplied us with large datasets of protein-protein interactions (PPIs) [7, 8], such as tandem affinity purification (TAP) and yeast two-hybrid (Y2H) [9]. To predict protein complexes, many researchers have proposed to study the structure of the resulting PPI network [10–15], which is an undirected graph with proteins represented as vertices and interactions between them represented as edges. For example, methods such as Markov Cluster (MCL) [16], Molecular Complex Detection (MCODE) [17], Clustering-based on Maximal Cliques (CMC) [18], Protein Complex Prediction (PCP) [19], and CFinder [20] are mainly based on the topological structures of PPI networks. Other methods such as Restricted Neighborhood Search Clustering (RNSC) [21] and Feng et al. [22] exploit biological information such as microarray data and gene ontology (GO) to strengthen the reliability of interactions so as to rebuild a more reliable PPI network and to predict complexes through a subgraph detection method from such PPI network. Some supervised approaches such as Bayesian classifier [23] also have been proposed. These methods, however, focus mainly on detecting densely connected subgraphs in PPI networks and are therefore not adapted to the identification of heterodimers. Indeed, for a complex involving only two proteins, the structure of the PPI network restricted to the involved two proteins is reduced to the presence or absence of an edge between them, and the prediction boils down to experimentally measured interaction. The methods above are not satisfactory because (i) high-throughput experimental measures are known to have high rates of false positives and false negatives, and (ii) two interacting proteins do not necessarily form a heterodimer, as they may instead be involved in a larger complex. As a result, it is difficult to predict heterodimers accurately with these methods, which have been evaluated for their ability to predict protein complexes consisting of at least three proteins.

Another class of methods focuses specifically on the prediction of heterodimers, using either random walks on PPI networks, such as the Repeated Random Walks (RRW) method [24] and the Node-Weighted Expansion (NWE) method [25], or a naive Bayes classifier as proposed by Maruyama [26], with features combining PPI data, GO annotations, and gene expression data. The later method has been shown to have better performance in F-measure for prediction of heterodimers than other existing prediction methods, including MCL, MCODE, RRW, and NWE.

To improve the prediction accuracy for heterodimers, Ruan et al. [27] proposed a supervised method with several features based on PPI weights. The weights are obtained from dataset WI-PHI (a Weighted yeast Interactive enriched for direct PHysical Interactions), which includes 49607 interacting protein pairs except self interactions and the weights of interactions between protein pairs. The main idea behind the design of feature space mappings is that the neighboring weights of a heterodimer tend to be smaller than the weight inside of the heterodimer. In addition to features based on weights, they proposed feature space mappings based on the number of protein domains because domains are considered to be functional and structural units in proteins. Furthermore, they designed a Domain Composition kernel based on the idea that two proteins having the same composition of domains as a known heterodimer are likely to form a heterodimer. The method showed considerable promise for heterodimer detection (F-measure=63.1%), significantly outperforming previous works.

Yong et al. [28] proposed a two-stage approach and test their approach on the prediction of yeast and human small complexes (consisting two or three distinct proteins). They carried out comparison with some popular complex prediction methods. Besides, they generated a larger number of novel predictions. However, on prediction of yeast heterodimers, they did not provide the measure performances of precision and recall. Therefore, we have no idea whether or not they achieve better performance than Ruan et al. [27] based on their results.

Note that Yugandhar et al. [29] applied a machine learning approach to classify protein-protein complexes based on their binding affinities. Their method reaches 76.1% accuracy to distinguish heterodimers into high and low affinity groups. However, they classify known heterodimers into different groups, but do not predict heterodimers from given protein pairs, hence their purpose is different from ours.

In this paper, our goal is to further improve the prediction accuracy for heterodimers. We investigate combination kernels to encode the domain composition of proteins involved in a complex since the one used in Ruan et al. [27] was very crude. More precisely, they define the similarity of domain composition in protein pairs very strictly, only considering two protein pairs with exactly the same compositions as an effective feature in the kernel function. We find that there is space to improve prediction from this point by replacing “exactly the same” with “similar”. For that purpose we propose to replace the Dirac kernel (which is 1 if and only if two proteins have exactly the same domain composition, 0 otherwise) by the so-called Min kernel, which counts the number of shared domains between two proteins. Furthermore, since our problem is formally to classify pairs of proteins as interacting or not, we exploit the notion of pairwise kernels to extend kernels between individual proteins to kernels between pairs of proteins, investigating in particular the metric learning pairwise kernel (MLPK) and tensor product pairwise kernel (TPPK), as explained in [30] and in the “Methods” section.

Besides, we consider that various sources of information may contribute to an accurate predictor. The combination of various sources can be divided into three situations: (1)various types of features with a single kernel; (2)one type of features with multiple kernels; (3)various types of features with multiple kernels. We test all the three situations and show only significant results in our computational experiments. On various types of features, besides the protein-protein interaction (PPI) and domain properties, we also try to use phylogenetic profile property. The reason is that two proteins that are both present or absent in the same genome are likely to have related functions. Moreover, protein subcellular localization property is considered as well. As proteins must be localized at their appropriate subcellular compartment to perform their function, proteins in the same location may have similar functions. On multiple kernels, we employ Min kernel and its two normalization forms, MinMax kernel and Scaled Min kernel, as well as two pairwise kernels, MLPK and TPPK.

Then, we employ C-Support Vector Classification (C-SVC), carry out ten-fold cross-validation and calculate the average precision, recall, and F-measures. The computational experiments show that using Min kernel improves the prediction performance, and the combinations of multiple kernels outperform single Min kernel, therefore is superior to [27] and other existing methods. However, combinations of new types of features that we presented do not contribute to accuracy improvement. Thus, situation (2) is more appropriate to our problem, though we do not eliminate the effectiveness of situation (3) by adding other useful types of features.

The rest of paper is organized as follows: “Methods” section introduces our methods including details of kernel combination and other types of features. “Results” section presents performance evaluation and comparison with other methods, as well as discussion on the results. “Discussion” section concludes the paper.

Methods

We formulate the problem of heterodimer prediction as a supervised binary classification problem. Given a set of pairs of proteins that known to form heterodimers (positive examples), and pairs of proteins that do not form heterodimers (negative examples) as training data, we learn a function f(x) to predict if a pair x of proteins in the test set can form a heterodimer (f(x)≥0) or not (f(x)<0). The definition of positive examples and negative examples are the same as [27]. To learn the function f(x) from a training set (x₁,y₁),…,(x_n,y_n), where each $x_{i} \in \mathbb {R}^{p}$ is a vector of descriptors for a pair of proteins and y_i∈{−1,1} indicates whether the pair can form a complex or not, we employ a C-support vector classification (C-SVC) classifier, with balanced loss penalty to compensate for the fact that the numbers of positive examples and negative examples are very unbalanced.

Various properties and multiple kernels

We explain multiple kernels involving properties of PPI, domain, phylogenetic profile, and subcellular localization in this section.

PPI and domain properties

For the PPI and domain properties, we follow the work in [27], feature space mapping ψ for a pair of proteins P_i, P_j is defined as

$$ {\begin{aligned} \boldsymbol{\psi}(P_{i}, P_{j})=\left(\begin{array}{c} w_{ij} \\ \max \left\{\max_{\{k|(i,k)\in E, k\neq j\} }w_{ik}, \max_{\{k|(j,k)\in E, k\neq i\}}w_{jk}\right\} \\ \min \left\{\min_{\{k|(i,k)\in E, k\neq j\} }w_{ik}, \min_{\{k|(j,k)\in E, k\neq i\}}w_{jk}\right\} \\ \max_{\{k|(i,k)\in E, (j,k)\in E\}} \min\{w_{ik}, w_{jk}\} \\ \max_{\{k_{1},k_{2}|(i, k_{1})\in E, k_{1} \neq j, (j, k_{2})\in E, k_{2} \neq i\}}\left\vert\, w_{{ik}_{1}}-w_{{jk}_{2}} \,\right\vert\\ \max\{\# \text{domains of} P_{i}, \# \text{domains of}~ P_{j}\}\\ \min\{\# \text{domains of} P_{i}, \# \text{domains of}~ P_{j}\}\ \end{array}\right), \end{aligned}} $$

(1)

where w_ij denotes the weight of the interaction between P_i and P_j. These are novel features proposed by Ruan et al., and the detailed descriptions of each feature can be found in [27].

There is another method involving domain property proposed in [27], called Domain Composition kernel. Here we briefly review it, since our approach is mainly on improving this part.

Suppose that there are several domains D_j in proteins. We define a feature space mapping ϕ_dom for protein P_i so that the j-th element of ϕ_dom(P_i) is the number of domains of D_j in P. For example, in Fig. 1, the left side is a protein P_i with domains D₁,D₁,D₃,D₄ and the right side is the corresponding feature space mapping ϕ_dom(P_i) with values (2,0,1,1,0,⋯) representing 2 D₁s, 0 D₂, 1 D₃, 1 D₄, 0 D₅, and so on, included in protein P_i. The dimension of ϕ_dom(P_i) is the total number of distinct domains contained in the whole proteins.

The formulation of Domain Composition kernel K_C for two pairs of proteins, (P₁,P₂) and (P₃,P₄), is defined as

$$ {{}\begin{aligned} K_{C}((P_{1}, P_{2}), (P_{3}, P_{4}))&=\max\{\delta(\boldsymbol{\phi_{dom}}(P_{1})=\boldsymbol{\phi_{dom}}(P_{3}))\\&\quad\delta(\boldsymbol{\phi_{dom}}(P_{2})=\boldsymbol{\phi_{dom}}(P_{4})),\\ \delta(\boldsymbol{\phi_{dom}}(P_{1})&=\boldsymbol{\phi_{dom}}(P_{4}))\\&\quad\delta(\boldsymbol{\phi_{dom}}(P_{2})=\boldsymbol{\phi_{dom}}(P_{3}))\}, \end{aligned}} $$

(2)

where δ(S)=1 if S holds, otherwise 0. It should be noted that the Domain Composition kernel is actually defined for pairs of two or more proteins.

In this study, we focus on replacing Domain Composition kernel with more promising combination kernels. Before presenting combination kernels, we first continuously introduce other properties.

Phylogenetic profile property

The phylogenetic profile of a protein is a vector that describes the presence or absence of homologs in organisms. It has been studied that proteins having similar profiles strongly tend to be functionally linked [31], and it is well known that proteins with similar functions are likely to form a complex. Therefore, we consider that phylogenetic profiles may be helpful for determining heterodimers.

To represent the subset of organisms that contain a homolog, we constructed a phylogenetic profile for each protein. This profile is a vector with m entries, where m corresponds to the number of genomes (2, 717 in the present article). We indicate the presence of a homolog to a given protein in the j-th genome with an entry of unity at the j-th element. If no homolog is found, the element is zero.

We compute phylogenetic profiles for the 5, 497 proteins encoded by the genome from KEGG OC [32], a novel database of ortholog clusters. Each protein sequence (P_i) is encoded by 2, 717 genomes, which consist of eukaryotes, bacteria and archaea. Proteins coded by the j-th genome are defined as including a homolog of a protein P_i if they align to the protein P_i with a score that is deemed statistically significant.

In Fig. 2, the left side are several genomes with their proteins and the right side are phylogenetic profiles for all proteins. We define a feature space mapping ϕ_phylo for protein P_i so that the j-th element of ϕ_phylo(P_i) describes whether or not the j-th genome contains P_i. For example, in the genomes, P₁ exists in EC and BS but not in SC, so for the phylogenetic profile of protein P₁, elements of EC and BS are 1, and SC is 0.

Subcellular localization property

Determining the subcellular localization of a protein is a key step toward understanding the cellular function of a protein, since proteins of the same subcellular localization tend to have similar function. We obtain the subcellular localization information for each protein from UniProtKB, such as cell membrane, cytoplasm, nucleus, and so on. Similar with phylogenetic profile property, we construct a feature space mapping ϕ_local(P_i) containing subcellular localization information for each protein P_i. The size of feature space is the sum of unique localizations for all proteins in our experiments, with elements 1 and 0, each represents whether or not the corresponding protein exists in the location (shown as Fig. 3).

Multiple kernels

In this section, we start to describe Min kernel with its normalization forms and two pairwise kernels.

Min kernel [33] counts the number of common elements in two feature vectors, which is a simple way to calculate the similarity of two binary vectors. Different from Domain Composition kernel, which outputs 1 or 0 representing exactly the same or not two protein pairs are, Min kernel counts the number of common domains in two proteins. With combining pairwise kernel presented below, combined-Min kernel shows the similarity of domain composition between protein pairs. Note that Min kernel has been shown to be useful for detection and recognition in [34, 35]. For feature vectors x,y, the Min kernel K_Min is defined by

$$ K_{Min}(\boldsymbol{x, y})=\sum\limits_{i=1}^{n}\min\{x_{i},y_{i}\}, $$

(3)

where x_i denotes i-th element of vector x, n denotes the number of elements of x, and x_i,y_i≥0 for all i.

When we present a kernel, its normalization form is usually used in kernel functions to improve prediction accuracy. Therefore, normalized versions are also proposed. Scale-normalization is a very common normalized version. For some kernel K, a scale-normalized kernel is defined as

$$ K_{norm}(\boldsymbol{x}, \boldsymbol{y})=\frac{K(\boldsymbol{x}, \boldsymbol{y})}{\sqrt{K(\boldsymbol{x}, \boldsymbol{x})K(\boldsymbol{y},\boldsymbol{y})}}. $$

(4)

Tanimoto kernel has been shown to have good performance on pairwise problems in the previous study [36], and it has a simple expression when applying to the Min kernel, which is called MinMax kernel. As a result, MinMax kernel is regarded as another normalization form of Min kernel. It computes the ratio of the intersection to the union of two feature mappings. For feature vectors x,y, MinMax kernel K_MinMax is defined as

$$ K_{MinMax}(\boldsymbol{x},\boldsymbol{y})=\frac{K_{Min}(\boldsymbol{x},\boldsymbol{y})}{{\sum\nolimits}_{i=1}^{n}\max\{x_{i},y_{i}\}}, $$

(5)

where K_Min is Min kernel.

Next, we briefly review two pairwise kernels, the Metric Learning Pairwise Kernel (MLPK) [30] and Tensor Product Pairwise Kernel (TPPK) [37].

Vert et al. [30] presents that MLPK kernel is a kernel for pairs and can be easily used to solve supervised classification problems. For heterodimer prediction problem, it infers pairwise relationships from hetero-protein pairs by defining a kernel between pairs of proteins from a kernel between individual proteins. MLPK kernel K_MLPK between pairs (x₁,x₂) and (x₃,x₄) is defined as

$$ \begin{aligned} K_{MLPK}((\boldsymbol{x}_{1}, \boldsymbol{x}_{2}),(\boldsymbol{x}_{3}, \boldsymbol{x}_{4})) &= (K(\boldsymbol{x}_{1}, \boldsymbol{x}_{3})-K(\boldsymbol{x}_{1}, \boldsymbol{x}_{4})\\& \quad-K(\boldsymbol{x}_{2}, \boldsymbol{x}_{3})+K(\boldsymbol{x}_{2}, \boldsymbol{x}_{4}))^{2}, \end{aligned} $$

(6)

The rationale behind MLPK is that the comparison between a pair (x₁,x₂) and another pair (x₃,x₄) is done through comparing the feature space of pair K(x₁,x₃)+K(x₂,x₄) and that of pair K(x₁,x₄)+K(x₂,x₃). In other words, MLPK compares pairs through the differences between their elements in the feature space.

Different from MLPK, TPPK kernel compares pairs by comparing x₁ with x₃ and x₂ with x₄ on one hand, and comparing x₁ with x₄ and x₂ with x₃ on the other. Both comparisons are obtained by a tensorization of the initial feature space. Therefore, this pairwise kernel is called the tensor product pairwise kernel. The equation of TPPK kernel is defined as

$$ \begin{aligned} K_{TPPK}((\boldsymbol{x}_{1}, \boldsymbol{x}_{2}),(\boldsymbol{x}_{3}, \boldsymbol{x}_{4})) &= K(\boldsymbol{x}_{1}, \boldsymbol{x}_{3})K(\boldsymbol{x}_{2}, \boldsymbol{x}_{4})\\& \quad+ K(\boldsymbol{x}_{1}, \boldsymbol{x}_{4})K(\boldsymbol{x}_{2}, \boldsymbol{x}_{3}), \end{aligned} $$

(7)

Kernel combinations

So far, we have mentioned three kernels between proteins: Min kernel, and two normalized versions, MinMax kernel and scaled kernel (called Normalized kernel in the results), as well as two pairwise kernels between protein pairs, MLPK kernel and TPPK kernel. We therefore consider all possible combinations (3×2=6) of these kernels.

For two protein pairs (P₁, P₂) and (P₃, P₄), we have the following combinations.

$$ {\begin{aligned} K_{M}((P_{1},P_{2}),(P_{3},P_{4})) &= (K(\boldsymbol{\phi}(P_{1}),\boldsymbol{\phi}(P_{3})) - K(\boldsymbol{\phi}(P_{1}),\boldsymbol{\phi}(P_{4}))\\ &\quad- K(\boldsymbol{\phi}(P_{2}),\boldsymbol{\phi}(P_{3})) \,+\, K(\boldsymbol{\phi}(P_{2}),\boldsymbol{\phi}(P_{4})))^{2}, \end{aligned}} $$

(8)

$$ {\begin{aligned} K_{T}((P_{1},P_{2}),(P_{3},P_{4})) &= K(\boldsymbol{\phi}(P_{1}), \boldsymbol{\phi}(P_{3})) K(\boldsymbol{\phi}(P_{2}),\boldsymbol{\phi}(P_{4}))\\ &\quad+ K(\boldsymbol{\phi}(P_{2}),\boldsymbol{\phi}(P_{3}))K(\boldsymbol{\phi}(P_{1}), \boldsymbol{\phi}(P_{4})), \end{aligned}} $$

(9)

where K(ϕ(P_i),ϕ(P_j)) denotes Min kernel or one of its normalized versions in the two equations. That is to say, we plug Min kernel and its normalized versions into Eqs. (8) and (9), respectively. Note that ϕ(P_i) can be any one of ϕ_dom(P_i), ϕ_phylo(P_i) and ϕ_local(P_i).

Then we combine the feature space mapping ψ (Eq. (1)) with the 6 combinations above, so we have

$$\begin{array}{@{}rcl@{}} {} K_{comb}((P_{1},P_{2}),(P_{3},P_{4}))&=&\langle\boldsymbol{\psi}(P_{1}, P_{2}), \boldsymbol{\psi}(P_{3}, P_{4})\rangle \\ &&+ \alpha K((P_{1},P_{2}),(P_{3},P_{4})), \end{array} $$

(10)

where α is a constant, and K is either of the 6 combination kernels. We call K_comb using K_Min “Min-MLPK kernel”, using K_MinMax “MinMax-MLPK kernel”, using K_norm “Normalized Min-MLPK kernel”, respectively. Similarly, when applying TPPK kernel, we just need to replace “MLPK” with “TPPK” for their names.

The study [30] pointed out that combination of MLPK and TPPK together by summation almost always leads to the best results. Therefore, by summation with MLPK (Eq. (8)) and TPPK equation (Eq. (9)), we have

$$\begin{array}{@{}rcl@{}} {} K_{comb}((P_{1},P_{2}),(P_{3},P_{4}))&=&\langle\boldsymbol{\psi}(P_{1}, P_{2}), \boldsymbol{\psi}(P_{3}, P_{4})\rangle \\ &&+ \alpha K_{M}((P_{1},P_{2}),(P_{3},P_{4})) \\ &&+ \alpha K_{T}((P_{1},P_{2}),(P_{3},P_{4})),\,\,\, \end{array} $$

(11)

We call K_comb using K_M and K_T “MinMax-MLPK-TPPK kernel”.

C-Support Vector Classification(C-SVC)

We use the C-Support Vector Classification (C-SVC) [38, 39] formulation that infers a function f(x)=w^⊤x that best separates positive examples from negative ones by solving the optimization problem:

$$ \begin{aligned} & \text{minimize} & & \frac{1}{2} \|\boldsymbol{w}\|^{2}+C^{+}\sum\limits_{y_{i}=+1}{\xi_{i}}+C^{-}\sum\limits_{y_{i}=-1}{\xi_{i}}\\ & \text{subject to} & & y_{i}\left(\boldsymbol{w}^{\top} \boldsymbol{x_{i}} +b\right) \geq 1 - \xi_{i}, \text{for all}\ {i}\\ & & & \xi_{i} \geq 0, \text{for all} {i} \end{aligned} $$

(12)

where C⁺ and C⁻ are regularization parameters for positive and negative examples, respectively. Instead of representing explicitly each pair of proteins by a vector of descriptors x∈R^p, we will use positive definite kernels K(x,x^′) in which case the C-SVC classifier takes the form $f(\boldsymbol {x}) = {\sum \nolimits }_{i=1}^{n} \alpha _{i} K(\boldsymbol {x_{i}},\boldsymbol {x})$ where the vector α∈Rⁿ is the solution of the dual problem:

$$ \begin{aligned} & \text{minimize} & & \boldsymbol{\alpha}^{\top} \mathbf{K} \boldsymbol{\alpha} - 2 \boldsymbol{\alpha}^{\top} \mathbf{1}\\ & \text{subject to} & & 0 \leq \alpha_{i} \leq C^{+}, \text{if }y_{i}=1\\ & & & 0 \leq -\alpha_{i} \leq C^{-}, \text{if }y_{i}=-1\\ \end{aligned} $$

(13)

where K is the n×n Gram matrix with entries K_ij=K(x_i,x_j) and 1 is the n-dimensional vector of ones. For implementation of C-SVC, we used libsvm (version 3.11) [40].

Results

Experiments

In order to compare our proposed method with the method in [27], we used the same dataset WI-PHI. The weights of interactions were calculated in the following way. (1)Used the high-throughput yeast two-hybrid data by Ito [8] and Uetz [7] as well as several databases such as BioGRID [11], MINT [12] and BIND [13] to build the literature-curated physical interaction (LCPH) dataset. (2)Constructed a benchmark dataset to evaluate high-throughput data. The interactions of the dataset were obtained by two independent methods from LCPH-LS, which was a low-throughput dataset in LCPH. (3)Calculated a log-likelihood score (LLS) to each dataset except LCPH-LS. (4)Computed the weight of each interaction by multiplying the socioaffinity (SA) indices [1] and the LLSs from different datasets. Note that SA index is the log-odds score of the number of times that we observed two proteins interact to each other to the expected value in the dataset.

Also, we prepared the same dataset from CYC2008 [2] for training and testing as the previous study. CYC2008 is a set of 408 manually curated yeast complexes. Compared with MIPS catalogue, which consists 215 heteromeric complexes, we believe that CYC2008 represents a more complete and up-to-date description of the stable yeast interactome, and should hence serve as an improved gold standard for the prediction of complexes. CYC2008 catalogue can be downloaded at: http://wodaklab.org/cyc2008/.

We defined a positive example as a pair of proteins included in WI-PHI as well as a heterodimer included in CYC2008. A negative example was defined as a pair of proteins included in WI-PHI, which meanwhile should not be any heterodimer but be a subset of some other complexes in CYC2008. As a result, we had 152 positive examples and 5345 negative examples.

Performance measure

We chose the following three measures to evaluate our performance. Precision describes the rate of correctly predicted positive examples to all positively predicted examples, and recall describes the rate of correctly predicted positive examples to all positive examples. Both of them indicate the effectiveness of the method from different aspects. F-measure is defined as their harmonic mean, which was used for evaluating the balance of precision and recall since it is insufficient to evaluate by any single one of them.

They are defined as

$$ {\kern6pt}\text{precision} = \frac{TP}{TP+FP}, $$

(14)

$$ {\kern21pt}\text{recall} = \frac{TP}{TP+FN}, $$

(15)

$$ \text{F-measure} = \frac{2\cdot \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}, $$

(16)

where TP,FP, and FN represent the numbers of correctly predicted positive examples, incorrectly predicted positive examples, and incorrectly predicted negative examples, respectively.

Results

We present below a comparison of our proposed combination kernels and the best existing method [27], which is titled as “Domain Composition kernel” in Figs. 4, 5 and 6. Note that features shown in Eq. (1) were used both in this study and [27].

In [27], they employed C-SVC with varying mixing parameter α = 0.0, 0.1, 0.2,..., 2.0. and regularization parameters C⁻ = 0.1, 0.2, ⋯, 2.0, C⁺ = 3.0, 3.5, ⋯, 6.0. The best result was obtained for α = 0.5, C⁻ = 1.0 and C⁺ = 4.0. In their experiments, they found that the results almost did not change while C⁻ varied. Therefore, in this study, we kept the value of C⁻ as 1.0, and set other parameters around the best value: α = 0.0, 0.1, 0.2,..., 1.0, and C⁺ = 3.5, 4.0, 4.5. By performing 10-fold cross-validation each time and taking the average of precision, recall, and F-measure, we used the same experimental procedure to compare the performance of Min kernel and its normalization forms, as well as kernels combining MLPK, TPPK and their summation MLPK + TPPK. The results are shown in Figs. 4, 5 and 6.

When C⁺ = 4.5 and α = 0.3, Normalized Min-MLPK kernel attains the best F-measure 0.686, compared with 0.631 in [27]. Figure 5 shows the results of each combination kernel on the average F-measures for the case when C⁺ = 4.0, C⁻ = 1.0 and α = 0.0, 0.1, 0.2,..., 1.0. The best two results 0.678 and 0.675 were obtained by the Normalized Min-MLPK kernel when α = 0.4 and 0.3, respectively. The third best result 0.673 was obtained by MinMax-MLPK kernel when α = 0.4.

These three figures indicate that all the MLPK-combined kernels outperform previously proposed Domain Composition kernel for every value of α, as well as all min kernels and its normalization forms, while TPPK-combined kernels are similar with Domain Composition kernel, even a little lower at some points. It demonstrates that converting to MLPK pairwise kernel indeed leads to better prediction performance.

Discussion

The better performance of MLPK compared to TPPK implies that protein pairs in the training set are similar to other pairs, but not similar to each other. This observation is not surprising because the composition of domains in given protein pairs and known heterodimers (protein pairs) are expected similar, while they do not have to be similar with each other. That is also the reason why Ruan et al. proposed Domain Composition kernel in [27]. It also confirms that the pairwise kernels deduced from the addition of the individual kernels performs better than the addition of the pairwise kernels deduced from individual kernels. Another interesting observation is that, although Vert et al. [30] showed that the summation of MLPK and TPPK almost always led to best results, regarding to our problem, the combination of MLPK and TPPK almost has performance between MLPK and TPPK.

We also show the results of subcellular localization property and phylogenetic profile property in Figs. 7, 8, 9. The results of localization keep the same and low as α changes. So it suggests that, unfortunately, localization property has no contribution to predicting heterodimers. This is surprising at first since two proteins could form a complex only if they are co-localized. However, the localization data is somewhat not complete because not all of yeast proteins are assigned localization, and many proteins are assigned to multiple locations. As a result, the information turns out to be not useful because only a small part of protein pairs share exactly the same localization.

For phylogenetic profile property, it performs better than [27] at many points when we applied MinMax-MLPK kernel to it, while performing worse when applying Min-MLPK kernel. In addition, we observed that Normalized Min-MLPK kernel and MinMax-MLPK kernel had better performances in most cases. The observation shows that the normalization form has contribution to improving prediction accuracy.

Table 1 shows the exact performance of each combination kernel on their best average precision, recall, and F-measure. Normalized Min-MLPK kernel had the best performance on precision (increased from 61.8 to 71.7%) and MinMax kernel had the best performance on recall (increased from 64.4 to 71.8%). Normalized Min-MLPK kernel achieved the best performance on F-measure (increased from 63.1 to 68.6%) and all the proposed methods that exclude TPPK-combined kernels outperform Domain Composition kernel. The last 5 rows are the results of other existing state-of-the-art methods, which were all given the same dataset WI-PHI as ours and executed with their default settings to predict heterodimers, except the option of the minimum size of predicted complexes, which was set to be two.

Table 1 Performance on the best average precision, recall, and F-measure for each combination kernel and other methods

Full size table

Conclusions

We applied multiple combination kernels based on various types of information, such as protein protein interaction, domain, subcellular localization, and phylogenetic profile to predicting heterodimers. We combined Min kernel (or its normalized forms) with the information above and a pairwise kernel (MLPK or TPPK) by plugging. To evaluate our proposed method, we performed ten-fold cross-validation computational experiments for the combination kernels. The results suggest that our proposed method improved the performance of our previous work, which had been the best existing method so far. In particular, the Normalized Min-MLPK has the best performance.

We indicated that for the problem of predicting heterodimeric protein complexes, multiple combination kernels have better performance than single kernel, and proved that MLPK-combined kernels nearly always have better prediction performance than TPPK-combined kernels. In addition, our results suggest that the information of PPI and domain is more meaningful and promising than subcellular localization and phylogenetic profile on this problem. Furthermore, we could give a conclusion that the information of subcellular localization has nearly no influence on prediction of heterodimers.

An interesting perspective for future research is to design a new kernel based on the neighboring topological structure and weight-labeled edge information, or extract the useful sequence information of protein complexes by deep learning to solve this problem.

Abbreviations

CMC:: Clustering-based on maximal cliques
C-SVC:: C-Support vector classification
GO:: Gene ontology
LCPH:: Literature-curated physical interaction
LLS:: Log-likelihood score
MCL:: Markov CLuster
MCODE:: Molecular complex detection
MLPK:: Metric learning pairwise kernel
NWE:: Node-weighted expansion
PCP:: Protein complex prediction
PPI:: Protein-protein interaction
RNSC:: Restricted neighborhood search clustering
RRW:: Repeated random walks
SVM:: Support vector machine
TAP:: Tandem affinity purification
TPPK:: Tensor product pairwise kernel
WI-PHI:: Weighted yeast interactive enriched for direct PHysical interactions

References

Gavin AC, Aloy P, Grandi P, Krause R, Boesche M. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440:631–6.
Article CAS PubMed Google Scholar
Pu S, Wong J, Turner B, Cho E, Wodak S. Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009; 37(3):825–31.
Article CAS PubMed Google Scholar
Mewes HW, Amid C, Arnold R, Frishman D, Guldener U. MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004; 34(Database issue):D169–72.
Google Scholar
Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature. 2002; 415(6868):180–3.
Article CAS PubMed Google Scholar
Gavin AC, Aloy P, Grandi P, Krause R, Boesche M. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440(7084):631–6.
Article CAS PubMed Google Scholar
Krogan NJ, Cagney G, Yu H, Zhong G, Guo X. Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature. 2006; 440(7084):637–43.
Article CAS PubMed Google Scholar
Uetz P, Giot L, Cagney G, Mansfield T, Judson R. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000; 403(6770):623–7.
Article CAS PubMed Google Scholar
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M. A comprehensive two-hybrid analyzes to explore the yeast protein interactive. Proc Natl Acad Sci USA. 2001; 98(8):4569–74.
Article CAS PubMed PubMed Central Google Scholar
Bartel PL, Fields S. The yeast two-hybrid system. New York: Oxford University Press; 1997.
Google Scholar
Kiemer L, Costa S, Ueffing M, Cesareni G. WI-PHI: A weighted yeast interactive enriched for direct physical interactions. Proteomics. 2007; 7(6):932–43.
Article CAS PubMed Google Scholar
Stark C, Breitkreutz B, Reguly T, Boucher L, Breitkreutz A. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006; 34(Database issue):D535–9.
Article CAS PubMed Google Scholar
Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M. MINT: a Molecular INTeration database. FEBS Lett. 2002; 513(1):135–40.
Article CAS PubMed Google Scholar
Alfarano C, Andrade C, Anthony K, Bahroos N, Bajec M. The biomolecular interaction network database and related tools 2005 update. Nucleic Acids Res. 2005; 33(Database issue):D418–24.
Article CAS PubMed Google Scholar
Sapkota A, Liu X, Zhao XM, Cao Y, Liu J. DIPOS: database of interacting proteins in Oryza sativa. Mol BioSyst. 2011; 7(9):2615–21.
Article CAS PubMed Google Scholar
Zhao XM, Zhang XW, Tang WH, Chen L. FPPI: Fusarium graminearum protein-protein interaction database. J Proteome Res. 2009; 8(10):4714–21.
Article CAS PubMed Google Scholar
Enright A, Dongen SV, Ouzounis C. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002; 30(7):1575–84.
Article CAS PubMed PubMed Central Google Scholar
Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003; 4:2.
Article PubMed PubMed Central Google Scholar
Liu G, Wong L, Chua HN. Complex discovery from weighted PPI networks. Bioinformatics. 2009; 25(15):1891–7.
Article CAS PubMed Google Scholar
Chua H, Ning K, Sung WK, Leong H, Wong L. Using indirect protein-protein interactions for protein complex prediction. J Bioinforma Comput Biol. 2008; 6(3):435–66.
Article CAS Google Scholar
Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T. CFinder: Locating cliques and overlapping modules in biological networks. Bioinformatics. 2006; 22(8):1021–3.
Article CAS PubMed Google Scholar
King A, Prulj N, Jurisical I. Protein complex prediction via cost-based clustering. Bioinformatics. 2004; 20(17):3013–20.
Article CAS PubMed Google Scholar
Feng J, Jiang R, Jiang T. A Max-Flow-Based approach to the identification of protein complexes Using protein interaction and microarray data. IEEE/ACM Trans Comput Biol Bioinforma. 2011; 8(3):621–34.
Article Google Scholar
Qi Y, Balem F, Faloutsos C, Klein-Seetharaman J, Bar-Joseph Z. Protein complex identification by supervised graph local clustering. Bioinformatics. 2008; 24(13):i250–8.
Article CAS PubMed PubMed Central Google Scholar
Macropol K, Can T, Singh A. RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics. 2009; 10:283.
Article PubMed PubMed Central Google Scholar
Maruyama O, Chihara A. NWE: Node-weighted expansion for protein complex prediction using random walk distances. Proteome Sci. 2011; 9(Suppl 1):S14.
Article PubMed PubMed Central Google Scholar
Maruyama O. Heterodimeric protein complex identification. In: ACM Conference on Bioinformatics, Computational Biology and Biomedicine. New York: ACM: 2011. p. 499–501.
Google Scholar
Ruan P, Hayashida M, Maruyama O, Akutsu T. Prediction of heterodimeric protein complexes from weighted protein-protein interaction networks using novel features and kernel functions. PLoS ONE. 2013; 8(6):e65265.
Article CAS PubMed PubMed Central Google Scholar
Yong CH, Maruyama O, Wong L. Discovery of small protein complexes from PPI networks with size-specific supervised weighting. BMC Syst Biol. 2014; 8(Suppl 5):S3.
Article PubMed PubMed Central Google Scholar
Yugandhar K, Michael Gromiha M. Feature selection and classification of protein–protein complexes based on their binding affinities using machine learning approaches. Proteins. 2014; 82(9):2088–96.
Article CAS PubMed Google Scholar
Vert JP, Qiu J, Noble WS. A new pairwise kernel for biological network inference with support vector machines. BMC Bioinformatics. 2007; 8(Suppl 10):S8.
Article PubMed PubMed Central Google Scholar
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci. 1999; 96(8):4285–8.
Article CAS PubMed PubMed Central Google Scholar
Nakaya A, Katayama T, Itoh M, Hiranuka K, Kawashima S, Moriya Y, Okuda S, Tanaka M, Tokimatsu T, Yamanishi Y, Yoshizawa AC, Kanehisa M, Goto S. KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters. Nucleic Acids Res. 2013; 41:D353–7.
Article CAS PubMed Google Scholar
Maji S, Berg AC, Malik J. Classification using intersection kernel support vector machines is efficient. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, p. 1–8. http://ieeexplore.ieee.org/document/4587630/.
Grauman K, Darrell T. The pyramid match kernel: Discriminative classification with sets of image features. Proc Tenth IEEE Int Conf Comput Vis. 2005; 2:1458–65.
Article Google Scholar
Lazebnik L, Schmid C, Ponce J. Beyond bags of feature: Spatial pyramid matching for recognizing natural scene categories. Proc 2006 IEEE Comput Soc Conf Comput Vis Pattern Recog. 2006; 2:2169–78.
Google Scholar
Swamidass S, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics. 2005; 21(Suppl 1):i359–68.
Article CAS PubMed Google Scholar
Ben-Hur A, Noble WS. Kernel methods for predicting protein-protein interactions. Bioinformatics. 2005; 21(Suppl 1):i38–46.
Article CAS PubMed Google Scholar
Osuna E, Freund R, Girosi F. Support vector machines: Training and applications. Technical Report. 1997.
Vapnik V. Statistical Learning Theory. New-York: Wiley-Interscience; 1998.
Google Scholar
Chang C, Lin C. A library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2(3):27:1–27:27. http://doi.acm.org/10.1145/1961189.1961199.
Article Google Scholar

Download references

Acknowledgements

We thank all reviewers for their time and effort. We also thank “International Research and Training Program of Bioinformatics and Systems Biology” of JSPS-International Training Program (ITP) for their supporting.

Funding

The work was partially supported by Grants-in-Aid #16K00392 and #26240034 from JSPS, Japan, and the European Research Council (grant ERC-SMAC-280032). The publication costs were funded by JSPS KAKENHI Grant #26240034.

Availability of data and materials

All datasets used in the work are publicly available and the source reference are given in main manuscript.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 19 Supplement 1, 2018: Proceedings of the 28th International Conference on Genome Informatics: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-1.

Author information

Authors and Affiliations

Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
Peiying Ruan
Department of Electrical Engineering and Computer Science, National Institute of Technology, Matsue College, 14-4, Nishiikumacho, Matsue, 690-8518, Japan
Morihiro Hayashida
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, 6110011, Japan
Tatsuya Akutsu
MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, 75006, France
Jean-Philippe Vert
Institut Curie, Paris, 75005, France
Jean-Philippe Vert
INSERM U900, Paris, 75005, France
Jean-Philippe Vert
Ecole Normale Supérieure, Department of Mathematics and Applications, Paris, 75005, France
Jean-Philippe Vert

Authors

Peiying Ruan
View author publications
You can also search for this author in PubMed Google Scholar
Morihiro Hayashida
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuya Akutsu
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Philippe Vert
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

PR, TA, and JPV contributed to the concept and design of the study. PR implemented the method, carried out the experiments and drafted the manuscript, MH gave technical support and valuable advices. All of the authors have read and approve the final manuscript.

Corresponding author

Correspondence to Jean-Philippe Vert.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Ruan, P., Hayashida, M., Akutsu, T. et al. Improving prediction of heterodimeric protein complexes using combination with pairwise kernel. BMC Bioinformatics 19 (Suppl 1), 39 (2018). https://doi.org/10.1186/s12859-018-2017-5

Download citation

Published: 19 February 2018
DOI: https://doi.org/10.1186/s12859-018-2017-5

Improving prediction of heterodimeric protein complexes using combination with pairwise kernel

Abstract

Background

Results

Conclusions

Background

Methods

Various properties and multiple kernels

PPI and domain properties

Phylogenetic profile property

Subcellular localization property

Multiple kernels

Kernel combinations

C-Support Vector Classification(C-SVC)

Results

Experiments

Performance measure

Results

Discussion

Conclusions

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

About this supplement

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us