Predicting protein-protein interactions via multivariate mutual information of protein sequences

Background Protein-protein interactions (PPIs) are central to a lot of biological processes. Many algorithms and methods have been developed to predict PPIs and protein interaction networks. However, the application of most existing methods is limited since they are difficult to compute and rely on a large number of homologous proteins and interaction marks of protein partners. In this paper, we propose a novel sequence-based approach with multivariate mutual information (MMI) of protein feature representation, for predicting PPIs via Random Forest (RF). Methods Our method constructs a 638-dimentional vector to represent each pair of proteins. First, we cluster twenty standard amino acids into seven function groups and transform protein sequences into encoding sequences. Then, we use a novel multivariate mutual information feature representation scheme, combined with normalized Moreau-Broto Autocorrelation, to extract features from protein sequence information. Finally, we feed the feature vectors into a Random Forest model to distinguish interaction pairs from non-interaction pairs. Results To evaluate the performance of our new method, we conduct several comprehensive tests for predicting PPIs. Experiments show that our method achieves better results than other outstanding methods for sequence-based PPIs prediction. Our method is applied to the S.cerevisiae PPIs dataset, and achieves 95.01 % accuracy and 92.67 % sensitivity repectively. For the H.pylori PPIs dataset, our method achieves 87.59 % accuracy and 86.81 % sensitivity respectively. In addition, we test our method on other three important PPIs networks: the one-core network, the multiple-core network, and the crossover network. Conclusions Compared to the Conjoint Triad method, accuracies of our method are increased by 6.25,2.06 and 18.75 %, respectively. Our proposed method is a useful tool for future proteomics studies.


Background
Identification of protein-protein interactions (PPIs) is important to elucidate protein functions and identify biological processes in a cell. The knowledge of PPIs can help people better understand disease mechanisms and drug designs. In the past several years, a large number of technologies have been developed for the large-scale analysis of PPIs. In general, there are three categories of methods for detecting PPIs: methods based on the information of *Correspondence: fguo@tju.edu.cn 1 School of Computer Science and Technology, Tianjin University, No.135, Yaguan Road, Tianjin Haihe Education Park, Tianjin, People's Republic of China Full list of author information is available at the end of the article evolution, methods based on natural language processing, and methods based on features of amino acid sequence.
A large number of past studies have made clear that the protein-protein interaction has a co-evolution trend [1]. The evolution information is extracted from multiple sequence alignment of homologous proteins. Tree similarity is used as a simple linear correlation between distance matrices of two protein families, as a proxy of their phylogenetic trees [2]. MirrorTree [3][4][5] evaluates the relationship between tree similarities and physical or functional interactions. It is possible to predict PPIs on a genomic scale with higher correlations indicating a higher probability of protein-protein interaction. Carlo et al. [6] presented a log-likelihood score for protein-protein interaction. Direct Coupling Analysis (DCA) has been used to predict response regulator (RR) interaction partners for orphan histidine sensor kinase (SK) proteins in bacterial two-component signal transduction systems [7]. They also presented a protein-protein interaction score, which is based on improved efficiency of multivariate gaussian approach [8]. However, since these methods need a large number of homologous proteins and interaction marks of protein partners, they are very difficult to compute and their applications are limited.
Many methods have been developed to find the evidence for PPIs from PubMed abstracts based on Natural Language Processing (NLP) [9]. According to a certain semantic model, these methods automatically extract relevant pieces of information from texts, since a large number of known PPIs are stored in the scientific literature of biology and medicine. Daraselia et al. [10] used a method, called MedScan, to extract more than one million pieces of data from PubMed. They obtained accuracy rates of up to 91 %, compared with the BIND and DIP databases [11]. The problem of this approach is that some PPIs information may be missing from literature, thus the prediction may not be complete.
It might be possible to predict PPIs accurately by using only protein sequence information with methods based on machine learning algorithms and features of amino acids. To use machine learning methods in this task, one of the most important computational challenges is to extract useful features from protein sequences. Generally, there are several kinds of feature representation methods including Auto Covariance (AC) [12], Auto Cross Covariance (ACC) [12], Conjoint Triad (CT) [13], Local Protein Sequence Descriptors (LD) [14,15], Multi-scale Continuous and Discontinuous feature set(MCD) [16], Physicochemical Property Response Matrix combined with Local Phase Quantization descriptor (PR-LPQ) [17], Multi-scale Local Feature Descriptors (MLD) [18], as well as Substitution Matrix Representation (SMR) [19]. AC and ACC [12] use seven physicochemical properties of amino acids to reflect their interaction modes whenever possible. After being represented by these seven descriptors, a pair of proteins could be converted into a 420-dimensional vector by AC, and 2940-dimension by ACC. CT [13] considers the properties of each amino acid and its vicinal neighbors and regards the three contiguous amino acids as a unit. The PPIs information of protein sequences can be projected into a homogeneous vector space by counting the frequency of each type. The 20 amino acids are clustered into seven groups according to dipoles and volumes of side chains. The descriptor of proteins were concatenated into a 686-dimensional vector by CT.
Similar to CT, LD [14,15] clusters twenty standard amino acids into seven functional groups. It splits the protein sequence into ten local regions of varying length to describe multiple overlapping continuous and discontinuous interaction patterns within a protein sequence. For each local region, three local descriptors-composition (C), transition (T) and distribution (D)-are calculated. A 1260-dimentional vector is constructed to represent each protein pair by LD. MLD [18] uses a multi-scale decomposition technique to divide protein sequence into multiple sequence segments of varying length to describe overlapping local regions. A binary coding scheme is then adopted to construct a set of continuous regions on the basis of the above partition. A 1134-dimentional vector is constructed to represent each protein pair by MLD. MCD [16] is similar to MLD, except that it constructs a 1764-dimentional vector for each protein pair. Indeed, LD, MCD and MLD can be categorized as the same type of methods.
PR-LPQ [17] adopts the physicochemical property response matrix method to transform the amino acids sequence into a matrix and then employs the local phase quantization-based texture descriptor to extract local phrase information in the matrix. SMR is based on BLO-SUM62, which is considered to be powerful for detecting weak protein similarities. Huang et al. [19] used BLO-SUM62 to construct a new matrix representation from a protein sequence. Then, the matrix is lossy compressed by Discrete Cosine Transform(DCT) and a 400-dimensional feature vector is extracted from the compressed matrix. Each pair of protein sequences forms an 800-dimensional feature vector, which is fed into the Weighted Sparse Representation based Classifier(WSRC) for predicting PPIs.
In this paper, we propose a novel sequence-based approach with a k-gram feature representation calculated as Multivariate Mutual Information (MMI). Combined with normalized Moreau-Broto Autocorrelation (NMBAC), we predict PPIs via Random Forest (RF), which is an ensemble learning method for classification, regression and other tasks. For the performance evaluation, our method is applied to the S.cerevisiae PPIs dataset. Our method achieves 95.01 % accuracy and 92.67 % sensitivity. Compared with the existing best method, the accuracy is increased by 0.29 %. To further demonstrate the effectiveness of our method, we also test it on the H.pylori PPIs dataset. Our method achieves 87.59 % accuracy and 86.81 % sensitivity. On the human 8161 PPIs dataset, our method achieves 97.56 % accuracy and 96.57 % sensitivity. In addition, we use S.cerevisiae PPIs dataset to construct a model to predict five other independent species PPIs datasets. Compared with the state-of-the-art methods, the accuracy is increased 2.42 % on average. We also test our method on two special PPIs datasets [20]. On the yeast dataset, our method achieves 82, 82, 62 and 61 % AUROC on four different test classes (typical Cross-Validated (CV) and distinct test classes C1, C2 and C3). On the human dataset, our method achieves 82, 82, 60 and 57 % AUROC on four different test classes. Finally, we test our method on three important PPIs networks: the one-core network (CD9) [21], the multiple-core network (Ras-Raf-Mek-Erk-Elk-Srf pathway) [22], and the crossover network (Wntrelated Network) [23]. Compared to the Conjoint Triad (CT) method [13], accuracies of our method are increased by 6.25, 2.06 and 18.75 %, respectively.

Methods
In our method for predicting protein-protein interaction based on protein sequence information, first we extract features from protein sequence information. The feature vector represents the characteristic on one pair of proteins. We use k-gram feature representation calculated as Multivariate Mutual Information (MMI) and extract additional feature by normalized Moreau-Broto Autocorrelation (NMBAC) from protein sequences. These two approaches are employed to transform the protein sequence into feature vectors. Then, we feed the feature vectors into a specific classifier for identifying interaction pairs and non-interaction pairs.

Multivariate mutual information
Inspired by previous work [13,24,25] for extracting features from protein sequences, we propose a novel method to fully describe key information of proteinprotein interaction. There exist many technologies using the k-gram feature representation, which is commonly used for protein sequence classification [26,27]. Here k represents the number of conjoint amino acids. For example, CT [13] used the 3-gram feature representation. Shen et al. [13] indicated that methods without considering local environment are usually not reliable and robust, so they produced a conjoint triad method to consider properties of amino acids and their proximate amino acids.
To continue the usage of k-gram feature representation and to enhance classification accuracy, we utilize MMI [28] for deeply extracting conjoint information of amino acids in protein sequences.

Classifying amino acids
The protein-protein interaction can be dominated by dipoles and volumes of diverse amino acids, which reflect electrostatic and hydrophobic properties. All 20 standard amino acid types are assigned to seven functional groups [13], as shown in Table 1. For each pair of proteins, we extract conjoint information based on these amino acid categories.

Calculating multivariate mutual information
Considering the neighbours of each amino acid, we regard any three contiguous amino acids as a unit. We use a sliding window of a length of 3 amino acids to parse the protein sequence. For each window, categories of three amino acids are used to label the type of this unit. Instead of considering the order of the three amino acids, we only consider the basic ingredient of the unit. We define different types of 3-gram feature representation, such as C 0 , C 0 , C 0 , C 0 , C 0 , C 1 , . . . , C 6 , C 6 , C 6 . Similarly, we also define different types of 2-gram feature representation, such as C 0 , C 0 , C 0 , C 1 , . . . , C 6 , C 6 . We count each type of 3-gram feature and 2-gram feature on one protein sequence by a sliding window, as shown in Fig. 1. At some point in the ensuing discussion of mutual information, we state the logarithmic base as e. In contrast to the standard mutual information approach, our mutual information and entropy method refer to single event on one protein sequence, whereas standard mutual information refers to overall possible events. We calculate the multivariate mutual information for each type of 3-gram feature, defined as follows: where a, b and c are categories of three conjoint amino acids in one unit. We then define the mutual information for one type of 2-gram feature as I(a, b), which can be counted by a 2length sliding window: where f (a, b) is the frequency of categories a and b appearing in 2-gram feature on a protein, and f (a) is the frequency of category a appearing on a protein, respectively. In addition, we define the conditional mutual information as I(a, b|c).
where H(a|c) and H(a|b, c) are the conditional entropy as follows. and where f (a|c) is the frequency of category a appearing while category c exists in 2-gram feature on a protein, and f (a|b, c) is the frequency of category a appearing while categories b and c exist in 3-gram feature on a protein.

H(a|c) and H(a|b, c)
can be approximately calculated as follows: and where f (a, b, c) is the frequency of categories a, b and c appearing in 3-gram feature on a protein.
To avoid the values of I(a, b, c) and I(a, b) being infinity, we calculate the frequency as follows: where n a is the occurrence number of category a appearing on a protein and L is the length of this protein sequence. We also use similar formulas to calculate f (a, b) and f (a, b, c). We can get 84 multivariate mutual information values of I(a, b, c) (3-tuples MI) and 28 mutual information values of I(a, b) (2-tuples MI) from one protein. We also compute the frequency of the seven amino acid categories appearing on this protein. A protein sequence is represented as 84 + 28 + 7 = 119 features. Finally, we combine the descriptors of two proteins to build a 238-dimensional vector for representing each pair of proteins.

Normalized moreau-broto autocorrelation
It is well known that PPIs include four interaction modes, usually expressed as electrostatic interaction, hydrophobic interaction, steric interaction and hydrogen bond. Feng et al. [29] introduced an autocorrelation function combining physicochemical properties of amino acids to propose a feature representation method, which is used to predict the types of membrane proteins. Inspired by this method, we use the NMBAC to extract features from protein sequences.

Six physicochemical properties of amino acid
The physicochemical properties we consider are hydrophobicity (H), volumes of side chains of amino acids (VSC), polarity (P1), polarizability (P2), solvent-accessible surface area (SASA) and net charge index of side chains (NCISC) of amino acid.
Values of these six physicochemical properties for each amino acid are listed in Table 2 [30]. They are first normalized to zero mean and unit standard deviation (SD) as follows: where P i,j is the value of descriptor j for amino acid type i, P j is the mean over 20 amino acids of descriptor value j, and S j is the corresponding SD. Each protein can be translated into six vectors with each amino acid represented by normalized values of six descriptors. So, NMBAC [29] can be computed as follows: where j represents one descriptor of six descriptor, i is the position in protein sequence X, n is the length of the protein sequence and lag is the sequential distance between one residue and another, a certain number of residues away (lag = 1, 2, . . . , lg), and lg is a parameter determined by an optimization procedure to be described. Inspired by AC [12], we select the optimal value of lag from 1 to 30. We can get 30 × 6 = 180 dimensional vector. We also compute the frequency of 20 amino acids appearing on this sequence. As a result, a protein sequence is represented as 30 × 6 + 20 = 200 features. Finally, we combine descriptors of two proteins, and build a 400dimensional vector to represent each pair of proteins by NMBAC.

Random forest classifier
RF is an algorithm for classification developed by Leo Breiman [31], which uses an ensemble of classification trees. Each classification tree is built by using a bootstrap sample of training data, and each split candidate set is a random subset of variables. RF uses both bagging (bootstrap aggregation) and random variable selection for tree building. Each classification tree is unpruned to obtain low-bias trees. The bagging and random variable selection can cause low correlation of individual trees. Therefore, RF has excellent performance in classification tasks. In this paper, the feature space of each pair of proteins is composed of MMI and NMBAC. Totally, there are 238 + 400 = 638 features to be encoded to represent each pair of proteins. We define a 638-dimentional feature vector F = (x 1 , x 2 , . . . , x 638 ) as the input data of RF model. The class label t of interacting pair or non-interacting pair is set as 1 or −1, respectively. If the number of cases in the training set is N, the sample is built by randomly choosing N cases from the original data, but with replacement. This sample will be the training set for growing the tree. There are M input variables, a number m M is specified such that at each node, m variables are selected at random out of M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. Each tree is grown to the largest extent possible without pruning. For the new test sample, the classification result can be obtained by a voting method on these trees.

Results
We test our method on several different PPIs datasets to evaluate the performance of our proposed approach, including S.cerevisiae, H.pylori, human 8161 , C.elegans, E.coli, human 1412 and M.musculus dataset. First, we independently analyze the performance of two protein representations, such as MMI and NMBAC. Second, we compare our method with some outstanding methods on the S.cerevisiae, H.pylori and human 8161 datasets. Then, we use the S.cerevisiae PPIs dataset to construct a model to predict other five independent species PPIs datasets. Our proposed method achieves a high performance on the S.cerevisiae, H.pylori and human 8161 datasets, so we evaluate the prediction performance of our model on five independent testing datasets. Our experiments suggest that experimentally identified interactions in one organism are able to predict interactions in other organisms. We also test our method on two special yeast and human PPIs datasets. In addition, we test our method on three important PPIs networks, and compare it with the state-of-the-art methods. We use our primary experimental information to predict real PPIs network, which is assembled by pairwise PPIs data.

PPIs datasets
The first PPIs dataset, described by You et al. [16], is downloaded from yeast S.cerevisiae core subset in the Database of Interacting Proteins (DIP) [11]. A protein with fewer than 50 residues or having more than 40 percent sequence identity are removed, and the remaining 5594 pairs of proteins formed the golden standard positive dataset (GSP). Non-interacting pairs are selected uniformly at random from the set of all interacting pairs that are not known to interact. Interacting pairs with the same subcellular localization information are then excluded. Finally, the golden standard negative dataset (GSN) is consisted of 5594 protein pairs, and their subcellular localization are different. The GSP and GSN datasets contain a total of 11188 protein pairs (half from the positive dataset and half from the negative dataset).
The second PPIs dataset, described by Martin et al. [32], is composed of 2916 H.pylori protein pairs (1458 interacting pairs and 1458 non-interacting pairs). The third PPIs dataset is collected from the Human Protein References Database (HPRD) as described by Huang et al. [19]. Huang  [14]. These species-specific PPIs datasets are employed in our experiment to verify the effectiveness of our proposed method.

Evaluation measurements
To test the robustness of our method, we repeat the process of random selection of the training and test sets, model-building and model-evaluating. This process is five-fold cross validation. There are seven parameters: overall prediction accuracy (ACC), sensitivity (SN), specificity (Spec), positive predictive value (PPV), negative predictive value (NPV), weighted average of the PPV and sensitivity (F score), Matthew's correlation coefficient (MCC). These parameters are defined as follows: where true positive (TP) is the number of true PPIs that are predicted correctly; false negative (FN) is the number of true PPIs that are predicted to be non-interacting pairs; false positive(FP) is the number of true non-interacting pairs that are predicted to be PPIs, and true negative(TN) is the number of true non-interacting pairs that are predicted correctly.

Experimental environment
In this paper, our proposed sequence-based PPIs predictor is implemented using C++ and MATLAB. All experiements are carried out on a computer with 2.5 GHz 6-core CPU, 32 GB memory and Windows operating system. Two RF parameters, the number of decision trees and split are 500 and 25.

Performance of PPIs prediction
We use eight different datasets to evaluate the performance of our proposed method. The proposed approach is compared with other methods on the S.cerevisiae, H.pylori and human 8161 datasets. Then, we test our method on the human 1412 , M.musculus, H.pylori, C.elegans, and E.coli datasets for PPIs prediction.

S.cerevisiae dataset
We use the first PPIs dataset used in You et al. [16] to evaluate the performance of our model.

Analyzing 2-tuples and 3-tuples MI
To analyze the performance of the 2-tuples and 3-tuples MI features by testing the S.cerevisiae dataset. The results of prediction for the 2-tuples and 3-tuples MI are shown in Table 3.
The accuracies for 2-tuples MI, 3-tuples MI and MMI are 93.56, 93.88 and 94.23 %, respectively. Obviously, the combinatorial approach of MMI achieves better performance than either 2-tuples MI or 3-tuples MI.  Analyzing MMI and NMBAC In order to understand the contribution of different feature representations, we evaluate the performance of MMI and NMBAC for PPIs prediction. We use the S.cerevisiae dataset, which is randomly partitioned into training and independent testing sets via a five-fold cross validation. Each of the five subsets acts as an independent holdout testing dataset for the model trained with rest four subsets. The cross validation can minimize the impact of data dependency and the reliability of experimental results can be improved. The prediction result is showed in To consider the asymmetric of proteins, the forward vector of one PPI is composed of two interacting proteins (protein A and protein B), and the backward vector is composed of reverse two interacting proteins (protein B and protein A). Accuracies on forward and backward vectors for PPIs prediction are 95.01 and 94.90 %, and the prediction result is less changed.

5-fold cross-validation
The prediction result of our method on S.cerevisiae dataset is shown in Table 5. We predict PPIs of S.cerevisiae dataset, and obtain accuracy, precision, sensitivity, and MCC of 95.01, 97.31, 92.67, and 90.1 %, respectively. Standard deviations of these criteria values are 0.46, 0.61, 0.5, and 0.92 %, respectively. High accuracies and low standard deviations of these criterion values show that our proposed model is effective and stable for predicting PPIs.

Comparison with existing methods
We compare the prediction performance of our proposed method with other existing methods on the S.cerevisiae dataset, as showed in Table 6. It can be observed that high prediction accuracy of 95.01 % is obtained from our proposed model. We use the same S.cerevisiae PPIs dataset, and compare our experimental result with methods proposed by You et al. [16,18,30], Wong et al. [17], Guo et al. [12], Zhou et al. [14] and Yang et al. [15]

H.pylori dataset
In order to highlight the advantage of our method, we also test it on the H.pylori dataset, which is described by Martin et al. [32]. We compare the prediction performance of our proposed method with other previous works including AC+CT+LD+MAC [30], MCD [16] DCT+SMR [19], phylogenetic bootstrap [33], signature products [32], HKNN [24], ensemble of HKNN [25] and boosting. In Table 7, we can see that the average prediction performance of our method, such as sensitivity, PPV, We also test our method on a human 8161 dataset, which is used by Huang et al. [19]. We compare the prediction performance between our proposed method and Huang's work [19] on this dataset, as showed in

PPIs identification on independent across species dataset
If large number of physically interacting proteins in one organism exist "co-evolved" relationship, their   [14,18,19]. Overall, the accuracy of ensemble representation is raised by 2.79 % than single representation (MMI and NMBAC) on these five independent species.

Two special PPIs datasets
Yungki Park and Edward M. Marcotte [20] proposed two PPIs datasets to evaluate pair-input computational predictions, including yeast and human data sets. We compare the performance of our method with seven methods (M 1 -M 7 ) of pair-input computational predictions on the two PPIs datasets: M 1 , a signature products-based method proposed by Martin et al. [32] and classified by SVM; M 2 , a protein sequence is described as in M 1 , and the feature vector for a protein pair is formed by applying the metric learning pairwise kernel and classified by SVM; M 3 , the SVM-based method of CT feature developed by Shen et al. [13]; M 4 , the SVM-based method of AC feature developed by Guo et al. [12]; M 5 , the PPIs feature is same as M 4 , and the classifier is the random forest; M 6 , a method developed by Pitre et al. [34]; M 7 , a method originally developed for protein-RNA interaction prediction [35]. We use the typical cross-validated (CV) predictive performances for three distinct test classes (C1, C2, C3). The performance of each method is summarized as the average area under the receiver operating characteristic curve (AUROC) ± its standard deviation and the corresponding average area under the precisionrecall curve (AUPRC) ± its standard deviation.
Prediction results are shown in Tables 10 and 11 Table 12 and Table 13. On new yeast PPIs dataset, our method achieves 0.65, 0.66, 0.60

PPIs networks prediction
The useful application of PPIs prediction method is the capability of predicting PPIs networks. Our method predicts three important PPI networks assembled by PPIs pairwise. The one-core network of CD9 is the simplest network, which is an important tetraspanin protein [21]. The result reveals that 14 of all 16 PPIs could be identified by our method, and accuracy is 87.50 %. Comparing to Shen's work [13], accuracy of our method is raised 6.25 %. Results are shown in Fig. 3, and the dark blue lines are true prediction, and red lines are false prediction.
The Ras-Raf-Mek-Erk-Elk-Srf pathway is a multiplecore network that has been implicated in a variety of cellular processes [22]. There are 189 PPIs in this network, 174 PPIs are predicted correctly by our method. Comparing to Shen's work, accuracy is raised 2.06 %. The prediction result and Ras-Raf-Mek-Erk-Elk-Srf pathway are shown in Fig. 4. The dark blue lines are true prediction, and red lines are false prediction.
The Wnt-related network is a typical crossover network, and its related pathway is essential in signal transduction. Ulrich et al. [23] has demonstrated the protein interaction topology of Wnt-related network. Shen et al. [13] have tested their method on the network. The accuracy of their method is 76.04 % in the network: there are 96 PPIs in this network, and 73 PPIs are predicted correctly by their method. We also try to predict PPIs in the Wnt-related network. The prediction result shows that 91 PPIs among all 96 PPIs in the network are discovered by our method, and the accuracy is 94.79 %, which is better than Shen's  method [13]. The prediction result and Wnt-related network are shown in Fig. 5. The dark blue lines are true prediction, and red lines are false prediction.

Discussion
Although many computational methods have been used to predict PPIs, the effectiveness of previous prediction models can still be improved. Existing methods that fail to take into account local amino acid environments are neither reliable nor robust, therefore we propose a Conjoint Triad method that accounts for the properties of each amino acid when accompanied by its two vicinal peptide amino acids. We use one PPIs dataset to construct a model to predict other five independent species PPIs datasets. This finding indicates that the proposed model can be successfully applied to other species for which experimental PPIs data is not available. It should be noticed that the biological Fig. 3 An one-core network for the CD9 network hypothesis of mapping PPIs from one species to another species is that large numbers of physically interacting proteins in one organism are co-evolved.
The most useful application of PPIs prediction method is its capability of predicting PPIs networks. Accurately predicting PPI networks is the most important issue for PPI prediction methods. We extend our method to predict three real important PPIs networks: one-core network, multiple-core network and crossover network. General PPIs networks are crossover networks, so our method is useful in practical applications. All these results demonstrate that our proposed method is a very promising and useful support tool for future proteomics research. Main improvements of the proposed method come from adopting an effective feature extraction method that can capture useful protein sequence information. In the future work, we will extend our method to predict other important PPIs networks.

Conclusions
In this paper, we develop a new method for predicting PPIs using primary sequences of two proteins. The prediction model is constructed based on random forest and ensemble feature representation scheme. In addition, we use MMI to improve the performance in predicting PPIs. For the performance evaluation, our method is applied to S.cerevisiae PPIs dataset. The prediction result shows that our method achieves 95.01 % accuracy and 92.67 % sensitivity. To further demonstrating the effectiveness of our method, we also use H.pylori PPIs dataset. Our method achieves 87.59 % accuracy and 86.81 % sensitivity. On human 8161 dataset, the experimental result shows that our method achieves 97.56 % accuracy and 96.57 % sensitivity. We use S.cerevisiae PPIs dataset to construct a model to predict other five independent species PPIs datasets. Our proposed method achieves 92.80, 92. 16   on E.coli, C.elegans, human 1412 , H.pylori and M.musculus datasets, respectively. We extend our method to predict three real important PPIs networks, and accuracy of our method is increased 6.25, 2.06 and 18.75 % compared with CT. The prediction ability of our approach is better than that of other existing PPIs prediction methods.