Skip to main content

Semi-supervised prediction of protein interaction sites from unlabeled sample information

Abstract

Background

The recognition of protein interaction sites is of great significance in many biological processes, signaling pathways and drug designs. However, most sites on protein sequences cannot be defined as interface or non-interface sites because only a small part of protein interactions had been identified, which will cause the lack of prediction accuracy and generalization ability of predictors in protein interaction sites prediction. Therefore, it is necessary to effectively improve prediction performance of protein interaction sites using large amounts of unlabeled data together with small amounts of labeled data and background knowledge today.

Results

In this work, three semi-supervised support vector machine–based methods are proposed to improve the performance in the protein interaction sites prediction, in which the information of unlabeled protein sites can be involved. Herein, five features related with the evolutionary conservation of amino acids are extracted from HSSP database and Consurf Sever, i.e., residue spatial sequence spectrum, residue sequence information entropy and relative entropy, residue sequence conserved weight and residual Base evolution rate, to represent the residues within the protein sequence. Then three predictors are built for identifying the interface residues from protein surface using three types of semi-supervised support vector machine algorithms.

Conclusion

The experimental results demonstrated that the semi-supervised approaches can effectively improve prediction performance of protein interaction sites when unlabeled information is involved into the predictors and one of them can achieve the best prediction performance, i.e., the accuracy of 70.7%, the sensitivity of 62.67% and the specificity of 78.72%, respectively. With comparison to the existing studies, the semi-supervised models show the improvement of the predication performance.

Background

Protein-protein interactions (PPIs) are involved in various life activities, such as metabolism and signal transduction, gene transcription, protein translation, modification and localization, and are also closely related to disease production [1,2,3,4,5,6,7,8,9]. However, PPI varies from cell to cell and from time to time, which poses a challenge to the studies of them.

Due to the rapid development of machine learning methods, many classical methods, such as Bayesian, support vector machine (SVM), and artificial neural networks, have been used to predict protein interaction sites [10,11,12,13,14,15,16,17,18,19]. Sprinzak et al. used the correlated sequence-signatures as identifiers for the interacting protein which can significantly reduce the search space and implement a directional experiment interactive screen and achieved high quality experimental results [5]. Bock et al. proposed a phylogenetic bootstrapping algorithm which suggests traversal of a phenogram, interleaving rounds of computation and experiment, to develop a knowledge base of protein interactions in genetically-similar organisms [6]. Enright et al. developed the hydrophobic free energy functions with the fusion detection based on sequence analysis and trained a SVM learning system to recognize and predict interactions based solely on primary structure and associated physicochemical properties, and the overall performance of the classifier has been significantly improved [20]. Chen et al. proposed a radial basis function neural networks optimized by the particle swarm optimization algorithm to predict protein interaction sites [2]. Wang et al. presented a SVM based algorithm to identify protein-protein interactions sites on the residues level by incorporating residues spatial sequence profile and evolution rate [21]. Wang et al. in another work implemented a dataset reconstruction strategy by using manifold learning under a hypothesis that the interaction and non-interaction sites have different inherent structure manifolds [13, 22]. Although these methods have driven advances in PPI research, there is still a problem that a lot of interactions cannot be tagged from experiments, and only a small part of labeled samples can be used for model training in the prediction of PPI sites, which will make it difficult for the well-trained learning systems to have strong generalization ability [23].

Therefore, this paper proposed three semi-supervised machine learning-based computational models to address the problem that the information of a large number of unlabeled samples can be utilized effectively to improve the performance of protein interaction site prediction when only a few of labeled samples can be available. Firstly, five evolutionary conserved features of amino acids based on multiple sequence alignments are extracted, i.e., the spatial sequence spectrum of residues, sequence information entropy, relative entropy, conservative weight and residue evolution rate. Then three semi-supervised learning methods are proposed, i.e., the self-balancing semi-supervised support vector machine based on multi-core learning (Means3vm-mkl), the iterative-based label average self-training semi-supervised support vector machine (Means3vm-iter) and the safe semi-supervised support vector machine (S4VM), to build the prediction model for identification of protein interaction sites [24,25,26]. The experimental results demonstrated the superiority of our proposed methods, such as the prediction accuracy of 0.707 for S4VM model, with comparison to the existing supervised and other approaches.

Results

In this work, three semi-supervised SVM algorithms, i.e., Means3vm-mkl, Means3vm-mkl, and S4VM, have been applied for the prediction of protein interaction sites from protein sequences. Compared to the traditional supervised SVM, semi-supervised models can effectively use the information from both of labeled and unlabeled samples. A popular software of support vector classification, Libsvm, is adopted in this work, where the empirically optimal parameters are used, such as C1 is 1, C2 equals 0.1, and the maximum number of generations is 50. To validate the effectiveness of the proposed models, a 5-fold cross-validation technique, and an original residue data set with 91 protein chains are used to evaluate the prediction performance of the proposed models. Herein, 2299 interface residues drawn from the definition of interaction sites can be used for the construction of the three semi-supervised SVM models.

It can be seen from Fig. 1 that the proposed three semi-supervised methods can classify the protein interaction and non-interaction sites on protein sequences. Means3vm-iter predictor can get good prediction measures, i.e., 0.636 of accuracy, 0.589 of F-measure, 0.67 of precision, 0.745 of specificity, 0.526 of sensitivity, and 0.278 of MCC, from the original protein residue data set D. Compared to Means3vm-iter method, the Means3vm-mkl based-model shows better prediction performance, where the accuracy rate and F-measure are increased by 3%. The reason is that multi-core learning can utilizes the feature mapping capabilities of each basic kernel, and the data is better expressed in the combined feature space constructed by multiple feature spaces, which can significantly improve the classification accuracy. But the process of multi-core learning is complex, and the time required is relatively longer than the method based on iterative optimization. Means3vm-iter transforms the optimization problem into a quadratic programming which and is, thus, easily and quickly solved by standard programs, although it may fall into local minimum, and the classification accuracy is slightly lower than the Means3vm-mkl based-model.

Fig. 1
figure 1

Classification performance evaluation of three Semi-supervised methods on datasets

It also can be found that the S4VM method can achieve the best prediction performance among the three semi-supervised models in all of the seven predictor measures, i.e., 0.707 of accuracy, 0.787 of specificity, 0.627 of sensitivity, 0.746 of precision, 0.681 of F measure and 0.419 of MCC. The overall prediction performance of S4VM is improved by more than 4% compared to Means3vm-iter and Means3vm-mkl methods. S4VM attempts to consider all possible low-density boundaries to effectively prevent performance degradation, and therefore it can deduce the false negative rate and false positive rate in prediction, which can be confirmed by the relatively high values of 0.787 and 0.627 for specificity and sensitivity, respectively.

To assess the generalization ability of prediction models, a five cross-validation strategy is adopted within the predictors’ construction. For each performance indicator, the differences between its value in each run and mean value in all of five cross-validation times are calculated. It can be observed from Fig. 2 that the three semi-supervised algorithms-based predictors are robust, and most of their fluctuation range is less than 0.03, which indicates that the proposed models have excellent generalization ability when new samples are introduced. Among them, the biggest difference of accuracy is Means3vm-iter model which has value of 0.015, precision in S4VM model with 0.043, sensitivity in Means3vm-iter model with 0.021, F-measure in Means3vm-iter model with 0.02, MCC in S4VM model with 0.036 and only Means3vm-mkl has a specificity of 0.048. Among the three predictors, S4VM performs best, and the mean value of difference is only 0.005 for accuracy, 0.032 for precision, 0.006 for sensitivity, 0.007 for specificity, 0.01 for F-measure and 0.021 for MCC.

Fig. 2
figure 2

Prediction performance measures in 5 repetitions of cross-validation

Discussion

It can be found that the three semi-supervised models proposed in this work can predict protein-protein interaction sites based on the features. To further evaluate the effectiveness of these models, the results of some previous works had been used to compare prediction performance.

Prediction performance comparison between supervised and semi-supervised SVM

Most studies adopted supervised machine learning algorithms to predict interaction sites from protein sequences or structures in previous works, and some of them have to use data sampling technologies to balance the number of positives and negatives to void the prediction bias [27,28,29,30]. Instead of semi-supervised methods where all of samples in model training are labeled, semi-supervised machine learning approaches are trying to learning from labeled and unlabeled samples, which can effectively make use more information for learning, which is very significant for the studies of protein interaction where many of them are still unknown.

To evaluate the value of unlabeled sample information in protein interface residues, the traditional supervised SVM algorithm is also directly used to make predictions of protein interaction sites, and its result is shown in Fig. 3. Based on the same dataset, it can be seen that the predictive performance of supervised SVM-based predictor is much lower than that of semi-supervised based one, and the accuracy is only 0.586, which is 0.12 lower than that of S4VM. On other measures, the proposed semi-supervised models also outperform the supervised predictor, i.e., the F-measure is only 0.56, MCC is only0.23, precision is only 0.598, sensitivity is only 0.529 and specificity is only 0.643.These results suggest that unlabeled sample information, when used in conjunction with a small data set of labeled data, can get much improvement in learning accuracy, which is important for the current situation that many protein interactions are not identified by experiments.

Fig. 3
figure 3

Comparison of experimental results between SVM and S4VM

Comparison with other approaches

In this work, five features related to amino acid conservation are extracted from protein sequence for identification of protein interface residues from protein surface. To validate the effectiveness of the extracted features in discrimination between interface and non-interface residues, the comparison with a previous study based on the same data set by Li and Kuo has been taken. Compared to the evolutionary conservation features used in this work, they predicted protein interaction sites using five sequence features. Figure 4 shows that both evolutionary and sequence features can successfully identify protein interaction sites, but evolutionarily conserved features show stronger classification capacity than that of sequence features. It can be found that our proposed method can produce more accurate prediction than sequence features-based model did, i.e., 0.124 higher in accuracy, 0.03 in sensitivity, 0.279 in specificity, 0.272 in F-measure and 0.326 in MCC. Especially, the value of precision measure has a 0.435 higher than that in Li and Kuo’s work, which means the false positive rate of prediction deduced dramatically, and the features in this work are really sound in discrimination between protein interaction and non-interaction sites.

Fig. 4
figure 4

Compared with the former’s evaluation performance

Visualization of experimental results

To further validate the predictions achieved by our proposed semi-supervised model, a test on one chain of protein complex 1A4Y was taken as an example. We use the molecular visualization tool - Pymol to show our predictions. Figure 5 shows the protein chain 1A4Y_A data set and the results obtained under three semi-supervised models. In D, E, and F, there are 218 balls that represent the surface residues involved in the prediction. Green balls, red balls, yellow balls and blue balls represent the number of TP, TN, FP and FN, respectively. The numbers in details on the complex 1A4Y can be found in Table 1. Our approach improves overall predictive performance and reduces false positives, and S4VM methods perform well. Only 5.2% of the interface residue in the S4VM method was not predicted.

Fig. 5
figure 5

Experimental visualization results. a represents the protein chain 1A4Y_A(a and b) is its spherical representation. c is the 1A4Y protein chain after extraction of surface residues. d, e and f show the predicted results of Means3vm-mkl, Means3vm-iter and S4VM, and the green balls, red balls, yellow balls and blue balls represent the number of TP, TN, FP and FN, respectively

Table 1 The number of predictions in TP, TN, FP and FN

Conclusion

This paper proposed a semi-supervised learning strategy for protein interaction site prediction. Firstly, a non-redundancy dataset with 91 protein chains were selected, and five evolutionary conserved features were extracted for the vectorization of each amino acid residue from the common databases and servers. Then three semi-supervised learning methods, Means3vm-mkl, Means3vm-iter and S4VM are proposed to identity interaction sites from surfaces of protein complexes. The experimental results show that the Means3vm-mkl and S4VM methods have excellent classification results a five-fold cross-validation is used, and the accuracy is 0.662 and 0.707, respectively. The S4VM method can achieve the best overall performance, such as the highest value 0.419 for MCC and 0.681 for F-measured value. By comparison with supervised model and other studies, this work produced much improvement in protein interaction sites prediction, which suggests that the effectiveness of the proposed semi-supervised strategies in discrimination of protein interaction and non-interaction sites. Furthermore, the experimental results demonstrated that residues in protein surface, even its interface label cannot be tagged yet, contain a lot of information of protein interactions, which is important for understanding cellular activity and drug design.

Methods

Dataset

The dataset used in this work are come from a previous work investigated by Ansari and Helms et al., where 170 pairs of transient protein interactions has been collected [31]. Protein chains of less than 50 residues and some outdated small family protein chains are discarded to make the data more representative. If there is a plurality of interacting partners for the same chain, the partner chain with most interfacial residues is represented. The BLASTCLUST program was used to exclude protein chains with sequence similarity greater than 30%, and finally 91 non-redundant protein chains has been remained for this study, which can be found in Table 2 [15, 32, 33].

Table 2 The protein chains used in this work

The definitions of the residues are same as what Fariselli et al. did in their work [30]. Surface residues are defined if the relative accessible surface area residue is bigger than 16% of the maximum accessible surface area for each type of amino acid. Among the surface residues, a residue can be defined as interface residues if the distance between its alpha carbon atom and that of any residues in the interaction chain is less than 1.2 nm, otherwise it will be categorized as non-interface residues. Based on the above definitions, the original residue data set D is composed of 2299 interface residues and 8131 non-interface residues which are obtained from the 91 protein chains used in this work.

Feature extraction

There are many properties of amino acids had been used for protein interactions or interaction sites prediction, among them evolutionary conservation analyses have been widely applied to characterize functionally/structurally important residues because these amino acids in a protein sequence are conserved through selective evolutionary pressure [20, 34,35,36]. In this work, five evolutionary conservation relevant features are extracted for protein interaction sites prediction, where residue spatial sequence, sequence information entropy, relative entropy and residue sequence weight are extracted from the HSSP database, and evolutionary rate residues are extracted from Consurf Serve.

The spatial sequence profile of amino acid residues, a feature widely used in protein related studies, represents the frequency of various amino acids at a given residue position in the primary structure of proteins. Protein residue sequence entropy is based on Shannon’s information theory to estimate the conservation score of sequence variability. Relative entropy is the normalized sequence information entropy. The conserved weight of the residue sequence is a calculation of position conservativeness of the protein sequence. The evolution rate of residues can be traded off from a statistical point of view, considering the linkages generated by the system in the stochastic process of sequence and evolution and the maximum likelihood estimate of the evolution rate can be accessed using the Rate4Site algorithm to calculate the conservation of each amino acid position [37]. For each residue, 20 dimensions for protein sequence profile and one dimension for each of other four features can be extracted in this work.

As many previous studies did, a slide-window strategy is also adopted in this work to consider the interface information, which is formed by the target residues centered with 10 spatially closest ones. Therefore, each target residue can be represented by a 264-dimensional vector and used for subsequent prediction construction.

Semi-supervised models

In semi-supervised learning, the labeled sample set is {x1, …, xl}{x1, …, xl}, and the unlabeled sample set is {xl + 1, …, xl + u}, where l and u are the number of labeled and unlabeled samples, respectively, yi = {+1, −1}. The labels of labeled and unlabeled sample set can noted as Il = {1, …, l}, and Iu = {l + 1, l + 2, …, l + u}. As one of the most popular semi-supervised learning methods, Semi-supervised support vector machines (S3VM) attempts to standardize and adjust decision boundaries by exploring unlabeled data based on clustering assumptions, whose illustration can be found in Fig. 6 [26, 38]. The meanS3VM, a fast S3VM algorithm, estimates the category average of the unlabeled data, so that the classification performance is very similar to the supervised SVM. The core idea of meanS3VM algorithm is to maximize the interval between the class averages of the two categories of samples, and thus the goal is to find the decision function f(x) = w (x) + b to minimize.

$$ \underset{d\in \Delta }{\mathit{\min}}\underset{w,b,\rho, \xi }{\mathit{\min}}\frac{1}{2}{\left\Vert w\right\Vert}^2+{c}_1{\sum}_{i=1}^l{\xi}_i-{c}_2\rho $$
(1)
$$ s.t.{y}_i\left({w}^{\prime}\varnothing \left({x}_i\right)+b\right)\ge 1-{\xi}_i,{\xi}_i\ge 0,i=1,\dots, l, $$
$$ \frac{1}{u_{+}}\left({w}^{\prime}\sum \limits_{j=l+1}^{l+u}{d}_{j-l}\varnothing \left({x}_j\right)\right)+b\ge \rho $$
$$ \frac{1}{u_{-}}\left({w}^{\prime}\sum \limits_{j=l+1}^{l+u}\left(1-{d}_{j-l}\right)\varnothing \left({x}_j\right)\right)+b\le -\rho $$
$$ \sum \limits_{i\in {I}_u}\mathit{\operatorname{sgn}}\left({w}^{\prime}\varnothing \left({x}_i\right)+b\right)=r $$
Fig. 6
figure 6

Illustration of different classification boundaries of SVM which considers only labeled data, and S3VM which considers labeled and unlabeled data, where green balls denote positives, red ones are negatives, and blue ones are unlabeled samples

Herein, the last constraint is an equilibrium constraint that avoids assigning all unlabeled samples to the same category, r is a user-defined parameter, and \( \Delta =\left\{d|{d}_i\in \left\{0,1\right\},\sum \limits_{i=1}^u{d}_i={u}_{+}\right\} \), \( {u}_{+}=\frac{r+u}{2} \), \( {u}_{-}=\frac{-r+u}{2} \). Since the bilinear constrains between w and b, this formula is non-convex, and the two algorithms can be used to solve it. The first one is based on multiple kernels learning (MeanS3VM_mkl), while the second one is based on alternating optimization (MeanS3VM_iter) [24].

  1. 1)

    MeanS3VM_mkl

Mathematically, the goal of S3VM can be solved with the dual form:

$$ mi{n}_{d\in \varDelta } ma{x}_{\alpha \in A}{\alpha}^{\hbox{'}}\tilde{l}-\frac{1}{2}{\left(\alpha \bullet \tilde{y}\right)}^{\hbox{'}}{K}^d\left(\alpha \bullet \tilde{y}\right) $$
(2)

which can be expressed in the form of a multicore learning optimization problem:

$$ mi{n}_{\mu \in M} ma{x}_{\alpha \in A}{\alpha}^{\hbox{'}}\tilde{I}-\frac{1}{2}{\left(\alpha \bullet \tilde{y}\right)}^{\hbox{'}}\left({\varSigma}_{t: dt\in \varDelta}\mu t{K}^{dt}\right)\left(\alpha \bullet \tilde{y}\right) $$
(3)

where M = {μ| ∑ut = 1, ut ≥ 0 },

$$ \alpha ={\left[{\alpha}_1,\dots, {\alpha}_{l+2}\right]}^{\prime}\epsilon {R}^{l+2} $$
$$ \overset{\sim }{I}={\left[{I}_1,\dots, {I}_t,0,0\right]}^{\prime}\in {R}^{l+2} $$
$$ \overset{\sim }{y}={\left[{y}_1,\dots, {y}_t,1,-1\right]}^{\prime}\in {R}^{l+2} $$
$$ A=\Big\{\alpha \mid \sum \limits_{i=1}^{l+2}{\alpha}_i\overset{\sim }{y_i}=0,\sum \limits_{i=1}^{l+2}{\alpha}_i={c}_2;0\le {\alpha}_i\le {c}_1, $$
$$ \forall i=1,\dots, l;0\le {\alpha}_l+1,{\alpha}_l+2\le {c}_2\Big\} $$

The nuclear matrix KdR(l + 2)  (l + 2), the element is \( {K}_{ij}^d={\left({\varnothing}_i^d\right)}^{\prime}\left({\varnothing}_i^d\right) \).

$$ {\varnothing}_i^d=\left\{\begin{array}{c}\frac{1}{u_{+}}\sum \limits_{j=l+1}^{l+u}{d}_j-l\varnothing \left({x}_i\right)i=l+1\\ {}\varnothing \left({x}_i\right)i=1,\dots, l\\ {}\frac{1}{u_{-}}\sum \limits_{-j=l+1}^{l+u}\left(1-{d}_j-l\right)\varnothing \left({x}_i\right)\ i=l+2\end{array}\ \right. $$
(4)

Since all dt are to be minimized and there will be a large number of reasonable dt, the cut plane algorithm is adopted to solve the above problem and find the optimal label d vector of the unlabeled samples, whose details can be found in references [17, 24].

  1. 2)

    Means3vm-iter

Another way to solve the problem of MeanS3VM is to alternate optimization, which can be abbreviated as:

$$ \underset{d\in \Delta , \rho }{\mathit{\max}}\rho $$
(5)
$$ s.t.\frac{1}{u_{+}}\left({w}^{\prime}\sum \limits_{j=l+1}^{l+u}{d}_j-l\varnothing \left({x}_j\right)\right)+b\ge \rho $$
$$ \frac{1}{u_{-}}\left({w}^{\prime}\sum \limits_{j=l+1}^{l+u}\left(1-{d}_j-l\right)\varnothing \left({x}_j\right)\right)+b\le -\rho $$

In the optimization process, if f(xi) > f(xj), i, jIu, then di − 1 ≥ dj − 1, which has been proved [11]. Assigning labels to unlabeled samples based on predicted values using this theorem, which can ensure that the label d vector obtained each time is better than the previous one [17, 24].

S4VM

Given a large amount of unlabeled samples in data set, there may be multiple “intervals” of low-density boundaries, and it is difficult to determine which one is the best. Although these low-density boundaries and the number of labeled samples are limited, due to the large differences, there will be a large loss if the selection is wrong, resulting in performance degradation, even worse than using only labeled samples, which limits the use of semi-supervised learning methods in certain key areas.

S4VM has been improved on traditional S3VM. The difference between S4VM and S3VM is that S3VM tries to focus on the best low-density boundary, while S4VM focuses on multiple possible low-density boundaries. The main idea is to optimize without giving many different “interval” boundaries. Class division of labeled samples. This maximizes the performance improvement of the support vector machine over the worst case labeled samples. The specific practices are as follows:

\( \mathrm{h}\left(\mathrm{f},\hat{\mathrm{y}}\right) \) is the objective function to be optimized by S3VM.

$$ h\left(f,\hat{y}\right)=\frac{{\left\Vert f\right\Vert}_H}{2}+{C}_1{\sum}_{i=1}^ll\left({y}_i,f\left({x}_i\right)\right)+{C}_2{\sum}_{j=1}^ul\left(\hat{y_j},f\left(\hat{x_j}\right)\right) $$
(6)

The goal is to find multiple low-density boundary lines \( {\left\{{\mathrm{f}}_{\mathrm{t}}\right\}}_{\mathrm{t}=1}^{\mathrm{T}} \) with “intervals” and the corresponding category division \( {\left\{\hat{{\mathrm{y}}_{\mathrm{t}}}\right\}}_{\mathrm{t}=1}^{\mathrm{T}} \) so that the following functions are minimized.

$$ \underset{{\left\{{f}_t,\hat{y_t\in \beta}\right\}}_{t=1}^T}{\mathit{\min}}{\sum}_{t=1}^Th\left({f}_t,\hat{y_t}\right)+ M\varOmega \left({\left\{\hat{y_t}\right\}}_{t=1}^T\right) $$
(7)

where T is the number of dividing lines, and Ω is a penalty function that measures the differentiation of the dividing line. Various functions can be used in the implementation. M is a large constant used to ensure the difference. Obviously, minimizing (7) can ensure the difference of the boundary line and the large interval. Without loss of generality, we assume that f is a linear function, f(x) = w (x) + b. The optimization problem that needs to be solved is expressed as.

$$ \underset{{\left\{{w}_t,\hat{b_t,{y}_t\in \beta}\right\}}_{t=1}^T}{\mathit{\min}}{\sum}_{t=1}^T\left(\frac{1}{2}{\left\Vert {w}_t\right\Vert}^2+{c}_1{\sum}_{i=1}^l{\xi}_i+{c}_2{\sum}_{j=1}^u\hat{\xi_j}\right)+ M\varOmega \left({\left\{\hat{y_t}\right\}}_{t=1}^T\right) $$
(8)
$$ s.t.{y}_i\left({w}_t^{\prime}\varnothing \left({x}_i\right)+{b}_t\right)\ge 1-{\xi}_i,{\xi}_i\ge 0 $$
$$ \hat{y_{t,j}}\left({w}_t^{\prime}\varnothing \left(\hat{x_j}\right)+{b}_t\right)\ge 1-\hat{\xi_j},\hat{\xi_j}\ge 0 $$
$$ \forall i=1,\dots, l,\forall j=1,\dots, u,\forall t=1,\dots, T $$

Then use an efficient sampling search strategy to solve (8). First, through the local search, find multiple large margin low-density separators. The k-means clustering algorithm is then used to identify representative splitters with a large variety of diversity [16].

Evaluation criteria

In addition to the accuracy, precision, sensitivity, and specificity which often used to evaluate predicted performance, F-measure and Mathew’s Correlation Coefficient (MCC) values are introduced. F-measure is a weighted harmonic average of recalls and precision, often used to evaluate classification models, and MCC is an effective measure in imbalanced data classification.

$$ Accuracy=\frac{TP+ TN}{TP+ FP+ TN+ FN} $$
(11)
$$ Sensitivity=\frac{TP}{TP+ FN} $$
(12)
$$ \mathrm{P} recision=\frac{TP}{TP+ FP} $$
(13)
$$ Specificity=\frac{TN}{FP+ TN} $$
(14)
$$ F- measure=2\times \frac{Precision\times Sensitivity}{Precision+ Sensitivity} $$
(15)
$$ MCC=\frac{\left( TP\times TN- FP\times FN\right)}{\sqrt{\left( TP+ FP\right)\times \left( TP+ FN\right)\times \left( TN+ FP\right)\times \left( TN+ FN\right)}} $$
(16)

where TP, FP, TN and FN represent the number of true positives (correctly predicted interface residues), the number of false positives (incorrectly predicted interface residues), the number of true negatives (correctly predicted non- interface residues) and the number of false negatives (incorrectly predicted non- interface residues), respectively.

Abbreviations

HSSP:

Homology-derived Secondary Structure of Proteins

MCC:

Mathew’s Correlation Coefficient

PPI:

Protein-protein interaction

S3VM:

Semi-supervised support vector machine

S4VM:

Safe semi-supervised support vector machine

References

  1. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005;437(7062):1173–8.

    Article  CAS  Google Scholar 

  2. Chen Y, Xu J, Yang B, Zhao Y, He W. A novel method for prediction of protein interaction sites based on integrated RBF neural networks. Comput Biol Med. 2012;42(4):402–7.

    Article  CAS  Google Scholar 

  3. Liu Q, Chen P, Wang B, Zhang J. Li J: dbMPIKT: a database of kinetic and thermodynamic mutant protein interactions. BMC Bioinformatics. 2018;19(1):455.

    Article  CAS  Google Scholar 

  4. Ji Z, Wang B, Yan K, Dong L, Meng G, Shi L. A linear programming computational framework integrates phosphor-proteomics and prior knowledge to predict drug efficacy. BMC Syst Biol. 2017;11(Suppl 7):127.

    Article  Google Scholar 

  5. Zhu M, Song X, Chen P, Wang W. Wang B: dbHDPLS: a database of human disease-related protein-ligand structures. Comput Biol Chem. 2019;78:353–8.

    Article  CAS  Google Scholar 

  6. Yang C, Ge SG. Zheng CH: ndmaSNF: cancer subtype discovery based on integrative framework assisted by network diffusion model. Oncotarget. 2017;8(51):89021–32.

    Article  Google Scholar 

  7. Ge SG, Xia J, Sha W, Zheng CH. Cancer subtype discovery based on integrative model of multigenomic data. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(5):1115–21.

    Article  CAS  Google Scholar 

  8. Chen P, Han K, Li X, Huang DS. Predicting key long-range interaction sites by B-factors. Protein Pept lett. 2008;15(5):478–83.

    Article  CAS  Google Scholar 

  9. Shen Z, Bao W, Huang DS. Recurrent neural network for predicting transcription factor binding sites. Sci Rep. 2018;8(1):15270.

    Article  Google Scholar 

  10. Pan XY, Zhang YN, Shen HB. Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. J Proteome Res. 2010;9(10):4992–5001.

    Article  CAS  Google Scholar 

  11. Xia JF, Wang SL, Lei YK. Computational methods for the prediction of protein-protein interactions. Protein Pept Lett. 2010;17(9):1069.

    Article  CAS  Google Scholar 

  12. Zhang YN, Pan XY, Huang Y, Shen HB. Adaptive compressive learning for prediction of protein-protein interactions from primary sequence. J Theor Biol. 2011;283(1):44–52.

    Article  CAS  Google Scholar 

  13. Wang B, Huang DS, Jiang C. A new strategy for protein interface identification using manifold learning method. IEEE Trans Nanobioscience. 2014;13(2):118–23.

    Article  Google Scholar 

  14. Jiang J, Wang N, Chen P, Zheng C, Wang B. Prediction of Protein Hotspots from Whole Protein Sequences by a Random Projection Ensemble System. Int J Mol Sci. 2017;18(7):1453.

  15. Wang B, Chen P, Wang P, Zhao G, Zhang X. Radial basis function neural network ensemble for predicting protein-protein interaction sites in heterocomplexes. Protein Pept Lett. 2010;17(9):1111–6.

    Article  Google Scholar 

  16. Ji ZW, Wang B, Yan K, Dong LG, Meng GM, Shi L. A linear programming computational framework integrates phosphor-proteomics and prior knowledge to predict drug efficacy. BMC Syst Biol. 2017;11(S 7):127.

  17. Hu SS, Chen P, Wang B, Li J. Protein binding hot spots prediction from sequence only by a new ensemble learning method. Amino Acids. 2017;49(10):1773–85.

    Article  CAS  Google Scholar 

  18. Zhu L, Deng SP, You ZH, Huang DS. Identifying spurious interactions in the protein-protein interaction networks using local similarity preserving embedding. Ieee Acm T Comput Bi. 2017;14(2):345–52.

    Google Scholar 

  19. Zhu L, You ZH, Huang DS, Wang B. LSE: A Novel Robust Geometric Approach for Modeling Protein-Protein Interaction Networks. PLoS One. 2013;8(4):e58368.

  20. Liu Q, Chen P, Wang B, Zhang J, Li J. Hot spot prediction in protein-protein interactions by an ensemble system. BMC Syst Biol. 2018;12(Suppl 9):132.

    Article  CAS  Google Scholar 

  21. Wang B, Chen P, Huang D-S, Li J-J, Lok T-M, Lyu MR. Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett. 2006;580(2):380–4.

    Article  CAS  Google Scholar 

  22. Wang B, Huang DS. Dataset reconstruction for protein interface identification using manifold learning method. In: IEEE International Conference on Bioinformatics and Biomedicine; 2014. p. 398–403.

    Google Scholar 

  23. Zhu L, Deng SP, You ZH, Huang DS. Identifying spurious interactions in the protein-protein interaction networks using local similarity preserving embedding. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(2):345–52.

    Article  Google Scholar 

  24. Li Y-F, Kwok JT, Zhou Z-H. Semi-supervised learning using label mean. In: International Conference on Machine Learning; 2009. p. 633–40.

    Google Scholar 

  25. Li Y-F, Zhou Z-H. S4VM: Safe Semi-Supervised Support Vector Machine. In: Computing Research Repository; 2010. abs/1005.1001.

    Google Scholar 

  26. Bennett K, Demiriz A. Semi-supervised support vector machines. Adv Neural Inf Proces Syst. 1999;11:368–74.

    Google Scholar 

  27. Iqbal M, Freitas AA, Johnson CG. A Hybrid Rule-Induction/Likelihood-Ratio Based Approach for Predicting Protein-Protein Interactions; 2009.

    Book  Google Scholar 

  28. Liu L, Cai Y, Lu W, Feng K, Peng C, Niu B. Prediction of protein–protein interactions based on PseAA composition and hybrid feature selection. Biochem Biophys Res Commun. 2009;380(2):318–22.

    Article  CAS  Google Scholar 

  29. Oh M, Joo KJ. Protein-binding site prediction based on three-dimensional protein modeling. Proteins Structure Function & Bioinformatics. 2009;77(S9):152.

    Article  CAS  Google Scholar 

  30. Fariselli P, Pazos F, Valencia A, Casadio R. Prediction of proteinâ protein interaction sites in heterocomplexes with neural networks&nbsp. FEBS J. 2010;269(5):1356–61.

    Google Scholar 

  31. Ansari S, Helms V. Statistical analysis of predominantly transient protein–protein interfaces. Proteins Struct Funct Bioinform. 2010;61(2):344–55.

    Article  Google Scholar 

  32. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.

    Article  CAS  Google Scholar 

  33. Chen P, Hu SS, Zhang J, Gao X, Li JY, Xia JF, Wang B. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction. Ieee Acm T Comput Bi. 2016;13(5):901–12.

    CAS  Google Scholar 

  34. Choi YS, Han SK, Kim J, Yang JS, Jeon J, Ryu SH, Kim S. ConPlex: a server for the evolutionary conservation analysis of protein complex structures. Nucleic Acids Res. 2010;38(Web Server issue):W450–6.

    Article  CAS  Google Scholar 

  35. Wei PJ, Zhang D, Li HT, Xia J, Zheng CH, Wei PJ, Zhang D, Li HT, Xia J, Zheng CH. DriverFinder: a gene length-based network method to identify Cancer driver genes. Complexity. 2017;2017(99):1–10.

    Article  Google Scholar 

  36. Wei PJ, Zhang D, Xia J, Zheng CH. LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network. Bmc Bioinformatics. 2016;17(Suppl 17):467.

    Article  Google Scholar 

  37. Glaser F, Pupko T, Paz I, Bell RE, Bechor-Shental D, Martz E, Ben-Tal N. ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics. 2003;19(1):163–4.

    Article  CAS  Google Scholar 

  38. Zhang X, Tian Y, Cheng R, Jin Y. A Decision Variable Clustering Based Evolutionary Algorithm for Large-scale Many-objective Optimization. IEEE Trans Evol Comput. 2018;22(1):97–112.

    Article  CAS  Google Scholar 

Download references

Acknowledgments

None.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 20 Supplement 25, 2019: Proceedings of the 2018 International Conference on Intelligent Computing (ICIC 2018) and Intelligent Computing and Biomedical Informatics (ICBI) 2018 conference: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-25.

Funding

This work was supported by the National Natural Science Foundation of China (Nos. 61472282, 61672035 and 61872004), Key Program for Educational Commission of Anhui Province of China (No. KJ2019ZD05, KJ2017A041), Co-Innovation Center for Information Supply & Assurance Technology in AHU (ADXXBZ201705), Anhui Province Funds for Excellent Youth Scholars in Colleges (gxyqZD2016068), and Anhui Scientific Research Foundation for Returned Scholars.

Author information

Authors and Affiliations

Authors

Contributions

YW (YeWang) and CM conceived of the study; XZ, YW (Yan Wang), YZ, JZ and YX participated in the experiment design; YW (Ye Wang), PC and BW carried it out and drafted the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Peng Chen or Bing Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Mei, C., Zhou, Y. et al. Semi-supervised prediction of protein interaction sites from unlabeled sample information. BMC Bioinformatics 20 (Suppl 25), 699 (2019). https://doi.org/10.1186/s12859-019-3274-7

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/s12859-019-3274-7

Keywords