Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme

Background Although various machine learning-based predictors have been developed for estimating protein–protein interactions, their performances vary with dataset and species, and are affected by two primary aspects: choice of learning algorithm, and the representation of protein pairs. To improve the performance of predicting protein–protein interactions, we exploit the synergy of multiple learning algorithms, and utilize the expressiveness of different protein-pair features. Results We developed a stacked generalization scheme that integrates five learning algorithms. We also designed three types of protein-pair features based on the physicochemical properties of amino acids, gene ontology annotations, and interaction network topologies. When tested on 19 published datasets collected from eight species, the proposed approach achieved a significantly higher or comparable overall performance, compared with seven competitive predictors. Conclusion We introduced an ensemble learning approach for PPI prediction that integrated multiple learning algorithms and different protein-pair representations. The extensive comparisons with other state-of-the-art prediction tools demonstrated the feasibility and superiority of the proposed method.


Background
Cells are predominantly composed of proteins, and almost every primary cellular process is performed by multiprotein complexes. By identifying and analyzing the components of protein complexes, we can better understand how protein ensembles are organized into functional units [1]. As protein-protein interactions (PPIs) are crucial to most cellular functions, they must be identified for deciphering cellular behaviors. In the past few decades, largescale PPI analysis has been enabled by techniques such as yeast two-hybrid (Y2H) systems [2], mass spectrometry [3], and protein chips [4]. However, these methods are time-consuming and expensive, and large-scale experiments usually suffer from high false positive rates [5]. Meanwhile, computational techniques can identify potential PPIs that are not discoverable by high-throughput methods. The computational predictions can then be verified by more labor-intensive methods.
Researchers have proposed different types of computational approaches based on different sources of biological information. For example, several methods can predict PPIs from protein sequences. SPRINT evaluates the likelihood of interactions by assessing the contributions of similar sequence motifs [6]. Huang et al. [7] translated protein sequences into feature vectors of composition and transition descriptors, and predicted the PPIs using a weighted sparse representation-based classifier. Guo et al. [8] combined a support vector machine (SVM) with auto covariance that predicts PPIs from protein sequences. Other methods utilize the genomic, proteomic, and/or structural information of proteins [9,10]. In recent years, semantic similarity has been applied to ontology, providing a valuable indicator of the relatedness level between two biological entities [11]. Observationally, proteins will likely interact when localized in the same cellular component, or when sharing a common biological process or molecular function. Accordingly, various methods infer PPIs from the gene ontology (GO) annotations and semantic similarity of proteins [12][13][14]. Other methods integrate semantic similarity with machine learning (ML) algorithms. For example, Ben-Hur and Noble [15], Bandyopadhyay and Mallick [16], and Armean et al. [17] combined GO annotations with SVM for PPI prediction. Other ML algorithms employed in PPI prediction include Bayesian classifiers [18] and random forest (RF) [19]. In addition, deep learning has recently been applied for PPI prediction. Sun et al. [20] used stacked autoencoders in their network architecture, Du et al. [21] adopted two separate deep neural networks to process the characteristics of each protein in a protein pair, and Gonzalez-Lopez et al. [22] introduced a deep recurrent neural network combined with the embedding techniques. These computational methods differ in their feature representations and algorithmic processes. Different ML approaches have distinctive inherent biases, including representation biases and process biases, which affect their learning behaviors and performances significantly even in the same learning task [23].
In this study, we propose a hybrid feature representation that combines protein sequence properties, gene ontology information, and interaction network topology. To reflect the characteristics of amino acids, we encode their various physicochemical properties (such as hydrophobicity, hydrophilicity, polarity and solvent accessible surface area) into the sequence-based features. To learn the knowledge organized in a directed acyclic graph (DAG) from GO, we develop the GO-based features by clustering the GO terms based on the partitioning of the GO DAG with respect to the provided training data. To address PPI prediction using a network reconstruction problem, we construct a partial network from the training data, and extract its topological properties as the networkbased features. We adopt a stacked generalization scheme [24] and develop a classifier called PPI-MetaGO, which improves PPI prediction by deducing the biases of the base generalizers and exploiting the synergy among various ML algorithms.
PPI-MetaGo was evaluated in consistent and unbiased tests on the datasets used in previous evaluations of state-of-the-art PPI prediction methods. The experimental results demonstrate the superior performance of PPI-MetaGO over several established PPI-prediction approaches.

Methods
This section describes our proposed ensemble supervised meta-learner PPI-MetaGO for PPI prediction. The protein pairs for training the ensemble meta-learner are represented in feature vectors constructed from the sequence-based physicochemical properties and the GObased semantic similarities. The PPI-MetaGO is implemented as illustrated in Fig. 1.

Feature extraction: sequence-based physicochemical features
As the basis for PPI prediction, we characterize proteins by 12 physicochemical properties of their composite amino acids [25][26][27][28][29][30][31][32], namely, hydrophilicity, flexibility, accessibility, turns scale, exposed surface, polarity, antegenic propensity, hydrophobicity, net charge index of the side chains, polarizability, solvent accessible surface area, and side-chain volume. Among the 12 properties, hydrophobicity and polarity are each calculated according to two different scales, respectively. The values of 14 physicochemical property scales of the 20 essential amino acids are listed in Table 1. We translated each amino acid into a vector of 14 numeric values, each corresponding to a physicochemical scale value in Table 1.
As an example, Fig. 2(a) shows the transformation of two proteins, P 1 and P 2 , into 14-element vectors. Each element in each vector corresponds to a physicochemical scale value [20,33].
As proteins vary widely in length, different proteins can be represented by different numbers of vectors. Meanwhile, the base classifier in an ensemble metalearner, such as an artificial neural network (ANN), knearest-neighbor (KNN) or naïve Bayesian (NB), requires a uniform input. For example in Fig. 2(a), protein P 1 composed of five amino acids is represented by five vectors, but protein P 2 with three amino acids is described by three vectors. To prepare a uniform input for the base classifier of the ensemble meta-learner, we transform the protein representation (a set of variable numbers of numeric vectors) into a uniform vectorial form with auto covariance [8,34], in which all proteins with varying numbers of amino acids are represented by vectors of the same length. The auto covariance (AC) of the physicochemical property scale of a protein sequence describes the average interactions between the amino acids separated by a certain gap throughout the entire protein sequence. Here, the gap is set as a certain number of residues between an amino acid and its neighbor. The AC of the ith physicochemical property scale, AC i,g , is given by where g is the pre-specified gap, L is the length of protein P, and μ i is the mean of the ith physicochemical scale values of protein P. Setting the maximum gap to G  (i.e. g = 1, 2, 3, …, G), we can represent any protein (regardless of length) as a vector of k × G AC variables, where k is the number of physicochemical property scales. Using auto covariance between amino acids, we are able to process the raw physicochemical scale values into a uniform vectorial form. All proteins, regardless of their lengths, can consequently be represented by vectors of the same length. For example, when G is set to 2 and there are 14 physicochemical scales, the numeric vectors of proteins P 1 and P 2 in Fig. 2(a) can be transformed into a uniform AC vectorial form shown in Fig.  2(b). Proteins P 1 and P 2 are represented by a vector of 28 AC values, respectively, even though they have different lengths.
To avoid the effects of variance, we first normalize the AC of each property scale to zero mean and unit standard deviation as follows: where S i is the standardized value, A i is the raw value of the ith AC, μ i and SD i denote the mean and standard deviation of the ith AC, respectively, and M is the number of AC values in the AC vector. Secondly, to ensure that the ACs derived from different physicochemical scales are commensurate and to further suppress the effects of outliers, we adopt a min-max scaling method that scales the standardized AC values to a fixed range of [0, 1]. The min-max scaling is described by Eq. (4).
where V i is the scaled value, S i is the standardized value of the ith AC, MAX i and min i are the maximum and minimum of the standardized values of the ith AC, respectively, and M is defined above. With two proteins represented by two AC vectors, protein pair (P 1 , P 2 ) can be represented in one of two common forms: (1) combination [V(P 1 ) ⊕ V(P 2 )], or (2) concatenation [V(P 1 )V(P 2 )]. Here, V(P) is the sequencebased feature vector corresponding to protein P, and the ⊕ operator adds the feature values of the two proteins in element-by-element fashion [16]. In our approach, the element-by-element feature values of two proteins are combined by concatenating the feature vectors. The concatenation avoids the need for applying a direct pairwise kernel on the feature space of protein pairs [16], which involves a complex kernel design, or applying specific binary operators such as addition or multiplication to each pair of elements, which introduce uncertain effects. However as mentioned in Bandyopadhyay and Mallick [16], concatenating the protein pair features is undesirable in PPI prediction because for the same protein pair P 1 and P 2 , [V(P 1 )V(P 2 )] and [V(P 2 )V(P 1 )] are differently represented in the feature space. Training a learner by one of the two representations will lose the information of the other representation. To resolve the order problem, we represent the protein pair (P 1 , P 2 ) by both concatenations, [V(P 1 )V(P 2 )] and [V(P 2 )V(P 1 )]. Provided with the concatenations in both orders for Fig. 2 Vectorial representations of two proteins, P 1 and P 2 . a Each amino acid AA i is first translated into a vector of 14 physicochemical scale values, b Both proteins, P 1 and P 2 , are later represented in a uniform vectorial form with 28 AC values. We demonstrate the calculation of the first two AC values of H 11 for P 2 when the gap is 1 (g = 1) and 2 (g = 2), respectively training, the learner can flexibly identify the (approximately) optimum decision regions for the PPI prediction, based on either of [V(P 1 )V(P 2 )] or [V(P 2 )V(P 1 )]. To classify a new protein pair (P 3 , P 4 ), we average the predicted class probabilities (interacting and non-interacting) produced by the trained learner for [V(P 3 )V(P 4 )] and [V(P 4 ) V(P 3 )], respectively, and predict the class of the protein pair (P 3 , P 4 ) according to the higher average probability.

Feature extraction: GO-based features
GO is a hierarchical vocabulary for annotating gene functions and their relationships with respect to their molecular function (MF), cellular components (CC), and biological process (BP) [35]. Each subontology is represented by a rooted DAG, where each node corresponds to a GO-term, and each link denotes a relationship between two terms, such as part_of or is_a. This hierarchical knowledge of the functional relationships between gene products has proved most useful for assessing the relevance of the involvement of genes in various biological activities [36], including PPI prediction [13,16,19].
Interacting proteins often participate in similar biological processes, exercise similar molecular functions, and/or co-localize in similar cellular components; consequently, they exhibit high GO semantic similarity [14,37]. Many measures of semantic similarity have been proposed and categorized into edge-based, node-based and hybrid methods [11]. The edge-based methods are mainly based on counting the edges along the paths between the GO terms being considered [38]. By contrast, the node-based approaches compare the properties of the involved terms, their ancestors, or their descendants [39,40]. One of the most commonly considered properties is the information content of the terms. Node-based measures are typically more reliable than edge-based methods in the biomedical field, because most of the edge-based measures assume that the distance between all relationships in an ontology is constant or depthdependent. Neither assumption is valid in existing biomedical ontologies. Alternatively, the hybrid methods assign weights to the edges and defines the semantic similarity after combining various types of measures, such as node depth, node link density, information content, or semantic contribution of the relationships (e.g. is_a or part_of ) [41].
We propose a novel approach that characterizes protein pairs based on the clustering of GO terms. Given two sets G i and G j of GO terms annotating each of the proteins P i and P j in a pair, we traverse the GO hierarchy from the GO terms in G i and G j up to their lowest common ancestor (ULCA) [19]. In this fashion, we identify the lowest common ancestor (LCA) of each protein pair <P i , P j > in a given set of protein pairs. The found LCAs are stored in a list sorted by ascending order of their hierarchical GO level. For each LCA in the sorted list in ascending order, we iteratively group that LCA and all its descendants into a cluster, excluding those already assigned to a previously formed cluster. The entire GO DAG is consequently partitioned into a set of mutually exclusive subgraphs, each rooted by an LCA, as illustrated in Fig. 3. In the sample hierarchy of Fig. 3, the two protein pairs <P 1 ,P 2 > and < P 5 ,P 6 > share a common LCA (G 11 ), which is denoted by LCA 3 . The LCA of protein pair <P 3 ,P 4 > (G 4 ) is denoted by LCA 2 . The LCAs of protein pairs <P 7 ,P 8 > and < P 9 ,P 10 > (G 15 and G 1 respectively), are denoted by LCA 4 and LCA 1 , respectively. These four LCAs are organized into a sorted list L in ascending order of their hierarchical levels, namely, L = (LCA 4 , LCA 3 , LCA 2 , LCA 1 ). The first LCA in the sorted list, LCA 4 , is grouped with all its descendants in the hierarchy. The resulting cluster contains G 15 , G 20 , G 21 , G 26 , G 27 , G 28 , G 33 , G 34 , G 35 , G 36 , G 42 , G 43 , G 44 , and G 45 . Similarly, by grouping all the descendants from G 11 (i.e. LCA 3 ), we represent the second cluster of GO terms by a hierarchical subgraph rooted at G 11 . This subgroup contains 11 GO terms, including G 11 itself. Continuing to the next LCA in the list, LCA 2 , we cluster all descendants of G 4 (i.e. LCA 2 ) that have not been assigned to an earlier cluster. Excluding the terms included in the second cluster, we form the third cluster of GO terms, constituting G 4 , G 7 , G 8 , G 12 , G 18 , G 24 , G 25 , G 32 , G 40 and G 41 . Finally, based on LCA 1 , we group G 1 , G 2 , G 3 , G 5 , G 6 , G 9 , G 10 , G 13 , G 14 and G 19 into the fourth cluster. The entire hierarchy is consequently partitioned into four subgraphs, each corresponding to an LCA, based on the provided training set of protein pairs, namely, {< P 1 , P 2 >,<P 3 , P 4 >,<P 5 , P 6 >,<P 7 , P 8 >}. Provided with different training protein pairs, we can partition the hierarchy accordingly to reflect the different interaction characteristics of the protein pairs.
Feature vectors of GO-terms have been constructed by considering the presence or absence of shared GO terms [19], or weighting the GO terms by their information content and local topology [16]. Instead, we define one GO-based feature as one GO-term cluster indexed by an LCA. To translate the sets of GO-term annotations G i and G j for each protein pair <P i , P j > into numeric values of LCA-indexed GO-based features, we first locate the GO terms in sets G i and G j on each LCAindexed subgraph. For each GO-term, we count the nodes along the ascending path up to the root of a subgraph, and sum the node counts on the subgraph. This sum is assigned as the value of the corresponding GOterm feature. Figure 4 shows the encoding of two protein pairs into two feature vectors, based on the four LCAindexed GO-term features presented in Fig. 3. To obtain the LCA-indexed GO-term feature vector for the protein pair <P 11 , P 12 >, we locate the GO terms of P i and P j on the hierarchy. The GO terms G 5 and G 6 are located in the subgraph of LCA 1 , terms G 7 and G 8 are located in the subgraph of LCA 2 , and G 20 is located in the subgraph of LCA 4 . The subgraph rooted at LCA 3 contains no GO-term of either P 11 or P 12 . Tracing along the ascending paths from G 5 and G 6 up to LCA 1 (blue arrows on the subgraph of LCA 1 in Fig. 4), we encounter G 5 , G 6 , G 3 , and G 1 (a total of four nodes). Therefore, the value of the LCA 1 -indexed GO-term feature is 4. Similarly, the values of the GO-term features indexed by LCA 2 and LCA 4 are determined as 3 and 2, respectively. As the subgraph of LCA 3 contains no GO terms of either P 11 or P 12 , the Go-term features indexed by LCA 3 are assigned a value of zero. Finally, the LCA-indexed  Example of encoding protein pairs into LCA-indexed GO-term feature vectors. The blue and green arrows show the ascending traversals up to the LCAs from the GO terms of <P 11 , P 12 > and < P 13 , P 14 >, respectively GO-term feature vector for <P 11 , P 12 > is obtained as (2,0,3,4). The GO terms of <P 13 , P 14 > are converted into a GO-term feature vector (0, 3, 3, 0) by the same process (see Fig. 4). Because the partitioning of the GO DAG depends on the given training data, the GO-based features of the same protein pair can vary in number and their values to adapt dynamically to the changes of training data. This flexibility warrants a better definition of GObased features and leads to higher predictive performances when the size and the quality of training data increase.

Feature extraction: network-based features
We derive the network-based features from the topological properties of a PPI network, N PPI = <V, E>, where V and E denote the node and link sets, respectively. Here, each node represents a protein, and each link is an interaction between two proteins. To predict the PPI of a set of proteins, we construct the PPI network N PPI , and whether two proteins are linked in N PPI depends on the semantic similarity of their GO terms. The functional similarity between two gene products can be determined by various similarity measures, some of which were originally developed for natural language taxonomy [37,40,41]. We here measure the functional similarity between proteins by the widely used Resnik's measure [40], which has proven superior in several prominent studies [12,39,42].
Resnik's measure quantifies the semantic similarity between two ontology terms t i and t j as the information content (IC) of their most informative common ancestor (MICA) [11,13,40]. Resnik's semantic similarity between t i and t j is defined by Eq. (5): where CA (t i , t j ) is the set of common ancestors of t i and t j in the GO hierarchy, and IC(t) is the information content of term t. IC(t) is defined by -log p(t), where p(t) is the occurrence probability of term t in a specific GO annotation corpus. Therefore, the Resnik's similarity between two proteins P i and P j , annotated to sets of GO terms G i and G j respectively, defines the maximum IC of the set G i × G j as After computing the Resnik's semantic similarity between any two proteins, we set one of the semantic similarities as the threshold θ R . The N PPI is then constructed by linking only the proteins with a semantic similarity above θ R . The threshold similarity θ R is obtained by deriving a reference PPI network, called N S , from the training set of protein pairs. In constructing N S , each protein pair is preclassified as interacting or non-interacting, and two proteins are connected only when confirmed as interacting in the training set. The θ R is then selected to equalize the average degrees in N PPI and N S , thereby capturing the PPI characteristics of the training data in N PPI . Based on the topology of N PPI , we create five network-based network features for each protein pair < P i , P j >: (a) number of common neighbors, (b) the Jaccard index, (c) the Adamic-Adar index, (d) the preferential attachment score, and (e) the Otsuka-Ochiai coefficient [43,44]. The network-based features are formally defined in Table 2. With the similar flexibility of the GO-based features, the network-based features of the same protein pair can be different and adapt when the training data change and so does the topology of the PPI network.

Stacked generalization
Ensemble learning combines many different classifiers into one predictive unit typically by majority voting. In simple voting schemes such as bagging [45], each classifier is allowed one vote, and the majority vote is accepted as the final prediction. Boosting [46] is a more complex scheme that weights the training examples by the difficulty of classifying them correctly, and updates the rewards to the classifiers based on the weights of their correctly classified examples. The final predictive unit is the weighted average of all classifiers over their rewards.
Unlike the bagging and boosting approaches, which mainly aim to improve the performance of a classifier by reducing the variance of multiple classifiers, our stacked classifiers operate as layered processes that aim to deduce the biases of the base generalizers [24]. In the stacked learning framework, each base classifier in a set is trained on a dataset, and their predictions are assembled as the meta-data. Successive layers of metaclassifiers receive the meta-data as the input for training the meta-models in parallel, then pass their outputs to the subsequent layer. A single classifier at the top level makes the final prediction. Stacked generalization is considered as a form of meta-learning because the transformed training data for the current layer contain the predictive information of the preceding learners, which constitutes a form of meta-knowledge. We developed a two-level stacked generalization architecture for PPI prediction. The bottom level comprises four base classifiers: RF [47], NB [48], ANN [49] and KNN [50]. At the top level, we place a Radial Basis Function (RBF) kernel SVM [51] as a meta-classifier that arbitrates among the base classifiers, and makes the final prediction. The base classifiers are trained on a set of protein pairs that have been pre-labeled as interacting or non-interacting, and translated to vectors of sequencebased features and GO-based features. The predictions of the base classifiers provide the meta-data for training the top-level SVM. To classify a new protein pair, we first feed its feature vector derived from the physicochemical properties, GO terms, and network topologies to each trained base classifier, which makes a prediction. Subsequently, the predictions of the four classifiers are input to the trained SVM, which makes the final PPI prediction for the new protein pair.

Datasets
Our PPI-MetaGO for PPI prediction was evaluated on the datasets of eight species: Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Escherichia coli. In the comparative analysis, we used the data collected from different databases and processed in earlier studies, namely, DIP [52], HPRD [53] and MIPS MPact [54]. The species, sizes and prediction tools of the datasets are summarized in Table 3. For species studied by different prediction methods on different datasets, such as H. sapiens, Table 4 summarizes the numbers of coincident proteins and protein pairs in the additional datasets. These numbers indicate the degrees of similarity between pairs of datasets, and should consequently be considered when evaluating and comparing the prediction methods.

Performance measures
To evaluate and compare the performances of PPI-MetaGO and other PPI prediction approaches, we conducted 10-fold cross-validation (CV) using the 7 measures: (1) true positive rate (TPR), (2) false positive rate (FPR), (3) precision, (4) percentage accuracy, (5) F-score, (6) Matthews correlation coefficient (MCC), and (7) area under receiver operating characteristic curve (AUC). The seven performance measures are defined as follows:  Numbers in parentheses are the total numbers of non-duplicated protein pairs in the two datasets, e.g. HS1 and HS2 where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

Performance comparison of PPI-MetaGo and recent PPI predictors
The PPIs predicted by PPI-MetaGO on the different datasets were compared with those of seven recent PPI predictors: PRED_PPI [55], SPRINT [6], TRI_tool [56], hierarchical vector space model (HVSM) [57], go2ppi [19], GIS-MaxEnt [17], and DeepSequencePPI [22]. Among these, PRED_PPI, SPRINT, TRI_tool, and DeepSequencePPI are sequence-based methods, whereas HVSM, go2ppi and GIS-MaxEnt are GOdriven approaches. Each of these prediction tools was previously trained and tested on a different dataset. In each experiment, we selected one tool for comparison with our proposed approach. To ensure a consistent and unbiased test, we trained and tested PPI-MetaGO exclusively on the training and evaluation datasets of the predictor selected for comparison. The performances of the different PPI prediction methods were evaluated by three times of stratified 10-fold CV. The dataset was randomly divided into 10 disjoint folds (subsets) of approximately equal size. The folds were stratified to maintain the same distribution of the interacting and non-interacting protein pairs as in the original dataset. One fold was retained for testing the prediction performance; the remaining nine folds were used for training. The same training-testing process was iterated on each fold. In each iteration, if the performance of the PPI predictors was sensitive to the parameter values, we optimized all settings in a systematic search (sequential or grid search) within a range of parameter values, and used the values yielding the best prediction. The result of each test run on the selected fold was pooled. After completing all iterations of the 10-fold CV, the results of all runs were averaged to obtain the overall performance of the predictor. The results are shown in Table 5. The ACC, F-score, MCC and AUC performances of PPI-MetaGO and the other PPI predictors were compared in paired t-tests. Conventionally, significant differences in comparison tables are marked with an asterisk. However, the asterisks in Table 5 indicate insignificant differences, highlighting that in most cases, PPI-MetaGO significantly outperforms the established prediction tool. Note that in Table 5 some of the values of AUC are higher than those of ACC, F-score, and MCC for the same dataset, such as in HS1 and SC2. This is because AUC is defined as the area under the ROC curve, which depicts the tradeoffs between true positives and false positives, while any of the other performance measures (e.g. ACC) merely corresponds to a single point in the ROC space, depending on the output score threshold specified for the prediction tools. To warrant the best performance of each tool for the CVs, we chose the threshold value that maximized the MCC in the training phase, and used that threshold for predicting PPI in the test phase of the CVs.
Based on SVM, PRED_PPI [55] was developed for predicting PPIs in humans, yeast, Drosophila, E. coli, and C. elegans. As shown in Table 5, the ACC, F-score, MCC, and AUC of the HS1, EC1, DM1, CE, and SC1 datasets (on which PRED_PPI was trained and tested) were significantly higher in PPI-MetaGO than in PRED_PPI (paired t-test, p < 0.05). The superiority of PPI-MetaGO could be attributable to the inclusion of GO-based and network-based features in its protein-pair representation, and the synergy of multiple base classifiers in its learning. Unlike PRED_PPI, both SPRINT [6] and TRI_tool [56] were specifically developed for PPI predictions in humans. SPRINT was designed for predicting the entire human interactome, whereas TRI_tool is a web-based online tool that automatically predicts transcriptional regulation interactions in humans. We compared PPI-MetaGO with SPRINT on the human PPI dataset HS2 (on which SPRINT was trained and tested). SPRINT applies an alignment algorithm that evaluates the contributions of similar protein subsequences to the likelihood of protein interactions. In contrast, the sequence-based features in PPI-MetaGO were derived from the physicochemical properties of amino acids. Although the ACC and MCC were significantly higher in PPI-MetaGO than in SPRINT (paired t-test, p < 0.05), the F-score and AUC were lower than in SPRINT, probably because SPRINT is designed specifically for human PPI prediction, and has been carefully trained on human PPI data. The final sequence-based PPI predictor competed against PPI-MetaGO was a web-tool called TRI_tool, which predicts PPIs using a pseudo amino-acid composition representation and an RF classifier. In this comparison, PPI-MetaGO and TRI_tool were tested on HS3 (on which TRI_tool was trained and evaluated). The ACC, F-score, MCC, and AUC were significantly higher in PPI-MetaGO than in TRI_tool (paired t-test, p < 0.05) although the improvement in PPI-MetaGo was modest. Instead of relying on hand-crafted features to represent a protein pair for PPI prediction in deep learning [20,21], DeepSequencePPI [22] learns low-level features directly from raw protein sequences by combining the embedding techniques with recurrent neural networks. We compared PPI-MetaGO and DeepSequencePPI on SC6, on which DeepSequencePPI had been earlier trained and tested. Compared with the other datasets collected from Saccharomyces cerevisiae, SC6 has the largest size in terms of the number of interacting and non-interacting protein pairs, respectively. The dataset size has a greater impact on deep learners than on other predictors because deep learning engages in feature extraction from raw data before constructing the prediction model. As a result, DeepSequencePPI could have more leverage with large datasets, such as SC6, than PPI-MetaGO. While the ACC, F-score, MCC, and AUC were significantly higher in DeepSequencePPI than in PPI-MetaGO (paired t-test, p < 0.05), the differences were marginal.
In addition to the sequence-based methods, we selected three state-of-the-art GO-driven approaches for comparison with PPI-MetaGO. To facilitate the paired comparisons with PPI-MetaGO, we tested each GOdriven approach on all three categories of GO terms, rather than sequentially evaluating the performance on each category. As a hybrid approach go2ppi [19] combines semantic similarity measures (SSMs) and ML. PPI-MetaGO and go2ppi-RF (using Random Forest) were evaluated on the eight datasets previously used for training and testing go2ppi. In six out of the eight datasets, except EC2 and AT, PPI-MetaGO significantly outperformed go2ppi-RF for all four measures, ACC, F-score, MCC and AUC (paired t-test, p < 0.05). PPI-MetaGO performed significantly better than go2ppi-RF for all measures except AUC in the EC2 dataset, and the differences in ACC and MCC were insignificant in the AT dataset (as indicated by the asterisks in Table 5). Rather than adopting a hybrid approach, HVSM refines the basic vector space model (VSM) approaches by relating the terms in the hierarchical structure of GO DAG. The method considers not only the directly annotated GO terms, but also their ancestors and descendants. The HVSM improves the expressiveness of the gene vectors transformed from GO terms, which should improve the accuracy of the similarity measure between vector pairs. We compared PPI-MetaGO and HVSM on HS5 and SC4, on which HVSM had been earlier trained and  tested. In an evaluation study, the similarity measure of HSVM achieved a higher AUC on HS5 and SC4 [57] than several popular SSMs, including TCSS [13] and Resnik's measure [40]. However, PPI-MetaGO, which adopts Resnik's measure in its hybrid approach, outperformed HVSM in AUC and all other performance measures (see Table 5). The third annotation-based method selected for a performance comparison with PPI-metaGO was GIS-MaxEnt. Unlike go2ppi and HVSM, GIS-MaxEnt incorporates two annotation sources, GO and InterPro, and processes them by a maximum entropy modeling method, thus preparing an input matrix for training the SVM in PPI prediction. We compared the performances of PPI-MetaGO and GIS-MaxEnt on the SC5 dataset, on which GIS-MaxEnt had been previously evaluated. PPI-MetaGO significantly outperformed GIS-MaxEnt for all four performance measures (ACC, F-score, MCC and AUC; paired t-test, p < 0.05).

Study of cross-species PPI predictions
In addition to intra-species self-tests, cross-species PPI prediction has been reported in several previous studies [19,58]. In these studies, the PPI predictor was trained on one species, and then tested on others. According to Park's [58] results, the cross-species predictive performances of sequence-based PPI predictors are considerably lower than intra-species self-test performances. An AUC of 0.9 achieved by 4-fold CV on a human dataset can decrease to 0.68 if the predictor was trained from yeast before application to the human dataset. In contrast to sequence-based prediction methods, Maetschke et al. [19] hypothesized that GO-based predictors can maintain good cross-species predictive performances because GO was designed as a species-independent annotation system. They separately tested go2ppi with an NB classifier on seven species in the BP, CC, and MF ontologies, and concluded that good prediction performance in the cross-species prediction requires a high intraspecies self-test performance. That is, the predictive performance on the target species was high when the selftest performance for that species was also high; otherwise, the cross-species performance was low. Following Maetschke et al. [19], we conducted the cross-species 10-fold CV tests of PPI-MetaGO and go2ppi-NB on the same datasets of the same seven species, using the BP, CC and MF ontologies separately. The AUCs are summarized in Table 6. The intra-species self-test results are shown diagonally in the cells in boldface for reference.
From Table 6, we note that PPI-MetaGO and go2ppi-NB achieved (almost) maximum AUCs on all self-tests, and the AUCs were usually higher than obtained from cross-species tests. Compared with PPI-MetaGO, go2ppi-NB produced substantially lower self-test and cross-species AUCs in most cases. Consistent with previous studies, the performance on the target species was high (low) when the self-test performance on that species was also high (low). In both Maetschke et al.'s and our study, the AUCs of the self-tests and cross-species tests were lowest on the mouse PPI dataset (the MM dataset; see final column of Table 6). Notably however, when PPI-MetaGO was trained on MM, it achieved reasonably high AUCs tested on the datasets of other species.

Discussion
We introduced an ensemble meta learning approach, PPI-MetaGO, for PPI prediction that integrated different protein-pair representations. To demonstrate its performances, we compared PPI-MetaGO with seven other PPI prediction tools on 19 protein datasets from eight species. Based on the design of PPI-MetaGO and the results of the experiments, we identified three issues worth further discussion. First, while Table 5 shows the superiority of PPI-MetaGO using a combination of three types of features, could other feature combinations produce the same level of synergy, and to what degree did they affect the prediction performances? Second, different benchmark datasets, even collected from the same species (e.g. HS1~HS5), have been used in previous studies of PPI prediction (see Table 3). How significant was data selection for evaluating the performances? Third, the GO-based and network-based features employed in PPI-MetaGO are obtained based on the partitioning of a GO term hierarchy and the topological properties of a PPI network, respectively. As the training data determine both the GO hierarchy clustering and the PPI network, the GO-based and network-based features can both vary when different training data are provided. How did they accommodate to the change of the benchmark datasets for the same species such as H. sapiens? We discuss these issues as follows.

Synergy of different feature combinations
PPI-MetaGO constructs a meta-classification model for PPI prediction using three types of features: physicochemical features, LCA-indexed GO-term features, and network-based features. For simplicity, we denote the feature types by F 1 , F 2 , and F 3 , respectively. The effects of combining F 1 , F 2 , and F 3 were summarized in Table 7, but a comparison with other feature combinations can provide insight into the importance of different feature types in PPI prediction. For this purpose, we tested all possible feature combinations in PPI-MetaGO on the same datasets, and analyzed their effects on the prediction performance. The results of different feature combinations on some PPI datasets are given in Table 7. As shown by the synergy of the F 1 , F 2 and F 3 features in Table 7, the PPI prediction was superior on most datasets, but some combinations or even single feature types maximized the performance on certain data sets. On the HS1, HS3, and SC2 datasets, the performance of PPI-MetaGo was generally higher for the combined three feature types than for the other feature configurations. However, on the DM1 dataset, the highest ACC, F 1 score, and MCC were obtained in PPI-MetaGo with the F 1 features alone. Meanwhile, the best achievement on EC2 was obtained by PPI-MetaGo with the F 1 and F 2 features. The performance discrepancies after varying the feature combinations suggest that each feature type makes a distinct contribution to the PPI prediction on different datasets.

Effects of training data on prediction performances
PPI-MetaGO generally outperformed its competing tools, as shown in Table 5, while we also observed that its performances could vary among different datasets even from the same species. For example, for H. sapiens it performed the best on HS1 for AUC as high as 0.993, but did poorly on HS2 with a markedly lower 0.791 AUC. According to Table 4, the contents of the datasets from the same species, HS1 to HS5 for example, can differ significantly as indicated by the small numbers, relative to the dataset sizes, of the proteins and protein pairs commonly shared between any pair of datasets. In addition, the non-interacting protein pairs, namely, negative examples, common to two datasets, such as HS2 and HS3, are markedly limited. The non-interacting protein pairs in HS4 and HS5 are entirely different, as shown in Table 4(b). These differences in the datasets can affect the training of any predictor, and consequently its predictive performance, as noted from  the values. By contrast, the proposed F 2 features of a protein pair are able to adapt to the changes of the training data because the partitioning of the GO DAG depends on the given training set of protein pairs (see Methods). Table 9 shows the numbers of F 2 features derived from different GO categories in each run of a 10fold CV for HS1. As indicated in Table 9, the numbers of the generated F 2 features varied according to different training data, and consequently their values for a protein pair were also adjusted to accommodate to the change.
In addition, Table 10 shows the average numbers of the F 2 features generated for HS1 to HS5 of H. sapiens. The high variance of the numbers of F 2 features generated for the different datasets from the same species suggests the high adaptability of the F 2 features. By contrast, the values of other GO-based features after being determined to describe a protein pair will remain the same regardless of the datasets. This flexible property enables the F 2 features to better adapt to new training data when available to improve predictive performances. Similar to the F 2 features, the proposed network-based F 3 features can also accommodate to the changes in the training data. The F 3 features are based on the topology of a PPI network constructed from the training data. The change in the training data may consequently alter the topology of the PPI network, and affect the F 3 features. In contrast to F 2 , the adaptability of F 3 does not modify the number of features while it revises the feature values for accommodating to the change in the training data. It is computationally prohibited to evaluate every change in the values of F 3 features due to the change of the training data in the experiments. Nevertheless, the combination of F 3 with F 1 , F 2 , or both generally produced higher predictive performances than F 1 or F 2 alone, as shown in Table 7. These findings fairly verify the contribution of F 3 .

Conclusions
Researchers have proposed various computational methods for predicting PPIs. These methods are characterized by two primary aspects: (a) the computational strategy that classifies the protein interactions, such as semantic similarity comparisons versus supervised ML approaches, and (b) the representation describing the protein pairs, such as amino acid properties versus GO annotations. These differences in the design philosophies affect the prediction performances of the methods. This study presented an ensemble meta-learning approach for PPI prediction, which utilizes the synergy of multiple ML algorithms and different protein-pair representations to improve the PPI prediction.
The performance of our proposed method, called PPI-MetaGO, was extensively compared with those of seven competitive PPI predictors on 19 protein datasets covering eight species. The experimental results demonstrated the favorable performances of PPI-MetaGO over other PPI predictors. The AUC of PP-MetaGo exceeded 0.9 on 14 out of the 19 datasets, reaching 0.95 or higher in 11 datasets. Following previous works, we also ran crossspecies PPI prediction tests. Again, the AUCs of PP-MetaGo were generally high, exceeding those of the competitive predictors in most of the cross-species PPI prediction tests. Overall, these results verify the feasibility and superiority of the proposed ensemble metalearning approach in PPI prediction. Moreover, as a wider variety of ML algorithms becomes available for base learning, more ontologies emerge for improving the annotations of biological entities or experimental assays, and the flexibility increases for building a stacked architecture appropriate to a certain prediction task, the proposed ensemble meta-learning strategy should become extendible to other domains.