Predicting protein-protein interactions using high-quality non-interacting pairs

Background Identifying protein-protein interactions (PPIs) is of paramount importance for understanding cellular processes. Machine learning-based approaches have been developed to predict PPIs, but the effectiveness of these approaches is unsatisfactory. One major reason is that they randomly choose non-interacting protein pairs (negative samples) or heuristically select non-interacting pairs with low quality. Results To boost the effectiveness of predicting PPIs, we propose two novel approaches (NIP-SS and NIP-RW) to generate high quality non-interacting pairs based on sequence similarity and random walk, respectively. Specifically, the known PPIs collected from public databases are used to generate the positive samples. NIP-SS then selects the top-m dissimilar protein pairs as negative examples and controls the degree distribution of selected proteins to construct the negative dataset. NIP-RW performs random walk on the PPI network to update the adjacency matrix of the network, and then selects protein pairs not connected in the updated network as negative samples. Next, we use auto covariance (AC) descriptor to encode the feature information of amino acid sequences. After that, we employ deep neural networks (DNNs) to predict PPIs based on extracted features, positive and negative examples. Extensive experiments show that NIP-SS and NIP-RW can generate negative samples with higher quality than existing strategies and thus enable more accurate prediction. Conclusions The experimental results prove that negative datasets constructed by NIP-SS and NIP-RW can reduce the bias and have good generalization ability. NIP-SS and NIP-RW can be used as a plugin to boost the effectiveness of PPIs prediction. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NIP.


Background
As the essential component of all organisms, proteins form the very basis of life and carry out a variety of biological functions within living organisms. A protein rarely accomplishes its functions alone, instead it interacts with other proteins to accomplish biological functions. It is thus generally accepted that protein-protein interactions (PPIs) are responsible for most activities of living organisms. As a hotspot of proteomics research, detecting Computational approaches have been developed to predict PPIs in an economic and reliable way. These approaches use different data types to predict PPIs, such as protein domains [12], protein structure information [13], gene neighborhood [14], gene fusion [15], and phylogenetic profiles [16,17]. Nevertheless, these methods are barely achieved if the pre-knowledge of the proteins is not available, i.e., protein functional domains, 3D structure of proteins, and other information [18]. As the explosive growth of sequence data, more and more researchers have moved toward sequence data based approaches to predict PPIs. Experimental results show that it is adequate to predict new PPIs using amino acid sequences alone [19][20][21][22][23][24][25][26][27].
Martin et al. [19] extracted the feature information of amino acid sequences by the extended signature descriptor and used support vector machine (SVM) to predict PPIs [19]. Shen et al. [20] adopted SVM as the classifier and encoded the feature information of amino acid sequences by conjoint triad (CT), in which the 20 standard amino acids are grouped into 7 categories on the basis of their dipoles and volumes of the side chains. This SVMbased approach yields a high prediction accuracy of 83.9%. However, this approach can not sufficiently encode the feature information, since CT only takes into account the neighboring effect of amino acid sequences, but PPIs usually occur at the non-continuous segments of amino acid sequences. Guo et al. [21] employed the auto covariance (AC) to detect the correlation among discontinuous segments and obtained an accuracy of 86.55%. You et al. [24] combined a novel multi-scale continuous and discontinuous (MCD) feature representation and SVM to predict PPIs. MCD feature representation can adequately capture continuous and discontinuous feature information of segments within an amino acid sequence. This method yields a high accuracy of 91.36% [24]. Different from these SVM-based approaches, Yang et al. [22] combined kNN and local descriptor (LD) to predict PPIs and obtained an accuracy of 83.73%. Du et al. [27] applied deep neural networks (DNNs) and integrated diverse feature descriptors to encode the feature information of amino acid sequences to predict PPIs. This approach obtains a high accuracy of 92.5% on predicting PPIs of Saccharomyces cerevisiae [27]. Wang et al. [28] used DNNs and a novel feature descriptor named local conjoint triad descriptor (LCTD), which encodes continuous and discontinuous feature information of local segments within an amino acid sequence, to predict PPIs. This approach yields a high accuracy of 93.12% on PPIs of Saccharomyces cerevisiae.
However, the performance of all the aforementioned sequence-based methods heavily depend on the quality of PPIs datasets. Positive examples (interacting protein pairs) are generally chosen based on reliable methods (small scale experiments), interactions confirmed by Y2H [6,7], Co-IP [4], and other methods; or interactions confirmed by interacting paralogs [29,30]. Therefore, given the public protein-protein interactions databases [31], the positive examples are readily available and can be easily constructed. The difficulty is that there are no 'gold standard' of non-interacting protein pairs (negative examples), which contribute to discriminatively predict PPIs. Two kinds of strategies are widely used by previous computational methods [19][20][21][23][24][25][26][27]. The first one randomly pairs proteins and then removes the pairs included in the positive examples [21,30]. The second constructs negative examples based on the subcellular localization of proteins [23,[25][26][27]. However, these two strategies have limitations and may compromise the prediction performance. The first strategy wrongly takes a large number of positive samples as negative samples, while the second strategy leads to a biased estimation of PPIs prediction [30].
In this paper, two novel approaches (NIP-SS and NIP-RW) are proposed to improve the performance of PPIs prediction. NIP-SS and NIP-RW separately generate reliable non-interacting pairs (negative dataset) based on sequence similarity and on random walk in the PPIs network. The basic idea of NIP-SS is: given a positive protein pair (i and j), and a protein k, the larger the sequence difference between i and k is, the smaller the probability that k interacts with j (i) is. In addition, we control the degree distribution of selected protein pairs to make it similar as that of the positive dataset. Given a PPI network G = (V, E), where V is the set of proteins, and E is the set of weighted undirected edges, where the weight reflects the interaction strength between a protein pair, 1 means an interaction, 0 means unknown. The basic idea of NIP-RW is: after a k-steps random walk on G, if the edge weight between two proteins is larger than 0, there may be an interaction between them; otherwise, there may be no interaction.
To investigate the effectiveness of NIP-SS and NIP-RW, we firstly collected the positive sets from Database of Interacting Proteins (DIP) [31], and separately constructed negative sets using four strategies : 1) NIP-SS, 2) NIP-RW, 3) subcellular localization, 4) random pairing, and then merged the positive set and each negative set to form a training dataset. Next, we used the auto covariance (AC) [21] descriptor to extract the features from amino acid sequences and Deep neural networks (DNNs) to predict PPIs. AC can account for the interactions between residues with a certain distance apart in the sequence and encode the features by a lower dimensional vector [21], DNNs can automatically extract high-level abstractions and reduce the model training time [32]. We performed comparative and quantitative experiments on public benchmark datasets to study the effectiveness of negative datasets generated by different strategies. The experimental results show that NIPI-SS and NIP-RW have good generalization ability and contribute to a higher accuracy in predicting PPIs than other related and widelyused strategies.

PPIs datasets
To comprehensively evaluate the rationality of NIP-SS and NIP-RW, we constructed 3 non-redundant positive PPIs sets for S. cerevisiae, H. sapiens, and M. musculus from DIP [31]. Next, we separately generated negative PPIs (non-interacting protein pairs) for these three species using NIP-SS, NIP-RW, subcellular location, and random pairing. After that, we merged the positive and negative sets for each species. As a result, twelve PPIs datasets are obtained. In addition, another six datasets were collected as the independent test datasets to further assess the generalization ability of NIP-SS and NIP-RW, Mammalian dataset collected from Negatome 2.0 [33] only contains non-interacting protein pairs, they were generated by manual curation of literature. The procedure of constructing the negative dataset will be introduced later.
The twelve datasets are divided into three groups based on the species. The experimental-validated PPIs of these three groups are all from DIP [31]. The first group contains 17257 positive PPIs of S. cerevisiae (version 20160731) collected by Du et al. [27]. The second and third groups are processed by ourselves, they contain 3355 and 923 positive PPIs of H. sapiens and M. musculus(version 20170205), respectively. These positive PPIs are generated by excluding proteins with fewer than 50 amino acids and with ≥ 40% sequence identity by cluster analysis via the CD-HIT program [34]. The excluded proteins have a heavy impact on the performance of PPIs prediction [21]. Each of these three groups contains four training sets and the difference between these four sets is the negative samples, which are generated by NIP-SS, NIP-RW, subcellular location, and random pairing. Table 1 gives the statistics of these 18 datasets.

Generating non-interacting protein pairs
Negative samples must be chosen with caution, which can heavily affect the performance of PPIs prediction. There are two primary strategies to construct negative samples, including random pairing and subcellular location. For the first strategy, after constructing the positive set from DIP, we count the number of proteins in the positive set and put these proteins into set P. Next, we can randomly select two proteins from P and take them as a noninteracting pair if they do not have an interaction in the positive set. Obviously, this random pairing is not completely reliable, it will produce a high rate of false negatives for generated negative examples, since the interactions between proteins in the DIP are far from complete. The second strategy is based on a hypothesis that proteins located in different subcellular localizations do not interact. A protein can be divided into seven groups based on subcellular location information extracted from Swiss-Prot (http://www.expasy.org/sprot/), including cytoplasm, nucleus, mitochondrion, endoplasmic reticulum, golgi apparatus, peroxisome and vacuole. The negative samples are obtained by pairing a protein from one group with another protein from the other groups. These negative samples further exclude the proteins pairs appeared in the positive set. However, Ben-Hur and Noble [30] proved that subcellular localization based approaches lead to a biased accuracy of PPIs prediction.
Motivated by the limitations of existing solutions, we proposed two novel approaches NIP-SS and NIP-RW to construct the negative datasets. Let G = (V, E) encode a PPIs network, where V is the set of proteins, and E stores the known interactions. To construct a reliable negative dataset with good generalization ability, we hope that proteins in the negative dataset are as many as possible. The average repeatability can be employed to describe the generalization ability of a dataset, which is calculated by means the degree of protein i. Note, if a protein in the negative dataset does not 'interact' with five proteins, then this protein have a degree of five. The smaller the value of r, the larger the generalization ability of this dataset is. On the one hand, we also hope that the degrees of proteins in the negative dataset are not too small, proteins with low degrees contain little predictive information and are not conducive for predicting PPIs. On the other hand, the degrees of proteins should not be too large, which will lead to an overestimation of prediction results. In addition, the maximum degree of proteins, the proportion of proteins in different ranges of degrees, and the proportion of non-interactions in each range all have an impact on the prediction performance. Given these reasons, we need to construct a reliable negative dataset, in which the degree distribution of proteins and interaction distribution are similar to those in the positive dataset. Such a negative dataset contains more proteins and has less bias.

Generating non-interacting protein pairs based on sequence similarity
The basic idea of NIP-SS is that, for an experimental validated PPI between protein i and j, if a protein k is dissimilar to i, there is a low possibility that k interacts with j. Based on this idea, we firstly generate the positive set of proteins P having confirmed interactions between another protein, and compute the sequential similarity between any two proteins in P. Next, we sort the sequence similarity between all protein pairs in P by the ascending order, and then select the top-m protein pairs with the lowest similarity as negative examples (non-interacting pairs), m is generally larger than the number of positive examples to facilitate the follow-up adjustment. If we employ these negative examples to form a negative dataset and then use this dataset to predict PPIs, it will lead to an over-estimation of PPIs prediction. This is because such negative dataset contains some proteins with very large degrees, which occur more frequently in the negative dataset than in the positive dataset. For example, the maximum degree in the positive dataset is 252, but 1439 in the initial negative dataset (see "Contribution of controlling degrees" section). As such, the bias is introduced into the training set composed with positive samples and negative samples. To ensure a good generalization ability, the degree distribution of proteins needs to be controlled during constructing the negative dataset.
We advocate to make the degree distribution of proteins in the negative dataset similar with that of the positive dataset. We firstly calculate the degree distribution of proteins, maximum degree, the proportion of proteins and the number of interactions in different ranges of degrees (such as the degree ≤ 10, the degree in (11,20], and so on) in the positive dataset. Similarly, we also compute the above values in the negative dataset. Next, we compare these values of positive and negative datasets, and then adjust the number of non-interacting partners of a protein by referring to the corresponding values of the positive dataset. Finally, we remove the protein pairs appeared in the positive dataset to generate the reliable negative dataset. The process of NIP-SS is shown in Fig. 1.
We collect the amino acid sequences data from the UniProt database [35]. Sequence similarity between two proteins i and j is calculated using blocks substitution matrix (BLOSUM), which is a substitution matrix used for sequence alignment of proteins [36]. BLOSUM matrices are used to score alignments between evolutionary divergent protein sequences. We adopt BLOSUM50 to compute the score between proteins, and then normalize the score as follows: where n is the total number of proteins in P, bl(i, j) is the original BLOSUM50 score of protein i and j.

Generating non-interacting pairs based on random walk
NIP-RW is motivated by the observation that interacting proteins are likely to share similar functions, level-1 (k = 1) neighborhood (or directly interacting) proteins are more probable to share functions than level-2 (k = 2) neighborhood proteins, whose interactions are mediated by another protein. In other words, the probability of sharing similar functions reduces as the increase of k [37]. Given that, two proteins that can only be connected after a k-step random walk, is less likely to share functions and thus less probable to interact with each other. The flowchart of NIP-RW is shown in Fig. 1.
where V is the set of proteins, and E is the set of edges. Each vertex u ∈ V stands for a unique protein, each edge (u, v) ∈ E represents an observed interaction between protein u and protein v, E ∈ R n×n stores available interactions between n proteins. We define a pair of proteins (u and v) as level-k neighbors if there exists a path φ = (u, · · · , v) with length k in G. The k-steps random walk process can be modeled as follows: After k-steps random walk, we can obtain a updated adjancency matrix W (k) ∈ R n×n , which reflects the inferred interaction probability (strength) between any pairwise proteins. Since E is generally very sparse, W (k) still encodes a sparse matrix. As such, the selected negative examples are inclined to proteins connected with few proteins, and lead to a bias of negative examples. To generate a negative dataset with good generalization ability, we use a sub-matrix W p×p of W (k) to control the number of proteins and degree distribution of these selected p proteins. After that, we select two proteins with W p×p (i, j) = 0 and take these two proteins as a non-interacting pair. We will investigate the parameter sensitivity of p and provide a principal way to specify p in "Contribution of controlling degrees" section.

Feature vector extraction
To effectively predict PPIs based on amino acid sequences, we need to extract and represent the essential information of interacting proteins by a feature descriptor. Many feature descriptors have been utilized to predict PPIs. Among these descriptors, conjoint triad (CT) [20] only takes into account the neighboring effect of amino acid sequences. However, PPIs generally occur at discontinuous segments of amino acid sequences. Local descriptor (LD) [23], auto covariance (AC) [21], multi-scale continuous and discontinuous (MCD) [24] and local conjoint triad descriptor (LCTD) [28] can effectively address this problem and achieve better prediction. Among these four descriptors, feature vectors encoded by AC have the lowest dimensionality. To balance the effectiveness and efficiency, we employ AC to encode the feature information of amino acid sequences, and then use DNNs to predict PPIs. To be self-inclusive, we introduce the AC feature descriptor in the following subsection.

Auto covariance (AC)
PPIs generally can be divided into four interaction modes: electrostatic, hydrophobic, hydrogen bond, and steric [38]. Seven physicochemical properties of amino acids can reflect these interaction modes whenever possible, including hydrophobicity [39], hydrophilicity [40], volumes of side chains of amino acids [41], polarity [42], polarizability [43], solvent-accessible surface area [44], net charge index of side chains [45]. The original values of these seven physicochemical properties for each amino acid are shown in Table 2. Feature normalization can improve the accuracy and efficiency of mining algorithms on the data [46]. Given that, we firstly normalize data with zero mean and unit standard deviation as follows: where P i,j is the j-th physicochemical property value for the i-th amino acid,P j is the mean of the j-th physicochemical property over 20 amino acids and S j is the corresponding standard deviation of the j-th physicochemical property. Then each amino acid sequence is translated into seven vectors with each amino acid represented by the normalized values. AC is a statistical tool introduced by Wold et al. [38], it is adopted to transform amino acid sequences into uniform matrices. AC can account for the interactions between residues using a certain lag apart the entire sequence. To represent an amino acid sequence A with length l, the AC variables are computed as: lag is the distance between residues. A ij is the j-th physicochemical property of the i-th amino acid of A, l is the length of the amino acid sequence A. In this way, the number of AC variables is D = lg × p, where p is the number of descriptors, which is set as 7 according to seven properties of amino acids. lg is the maximum distance lag(lag = 1, 2, ..., lg), which is set as 30 [21]. After that, each amino acid sequence is encoded by a 210-dimensional vector

Deep neural networks
Deep learning, the most active field in machine learning, attempts to learn multi-layered models of inputs. It has been achieving great success in many research areas, such as speech recognition [47], signal recognition [48], computer vision [49][50][51], natural language processing [52,53] and so on. Meanwhile, it also has been widely employed in bioinformatics [54,55]. Deep learning is not only good at automatically learning the high-level features from the original data, but also good at discovering intricate structures in high-dimensional data [56]. Deep neural networks (DNNs) are composed of an input layer, multiple hidden layers (three or more hidden layers), and an output layer, the configuration of adopted DNNs is shown in Fig. 2. In general, neural networks are fed data from the input layer (x), then the output values are sequentially computed along with hidden layers by transforming input data in a nonlinear way. Neurons of a hidden layer or output layer are connected to all neurons of the previous layer [32]. Each neuron computes a weighted sum of its inputs and applies a nonlinear activation function to calculate its outputs f (x) [32]. The nonlinear activation functions usually include sigmoid, hyperbolic tangent, or rectified linear unit (ReLU) [57]. ReLU and sigmoid are employed inthis work.
We separately construct two DNNs using TensorFlow platform, as illustrated in Fig. 2. Next, the feature vectors of two individual proteins extracted by AC are employed as the inputs for these two DNNs, respectively. After that, these two separate DNNs were combined in a hidden layer to predict PPIs. Adam algorithm [58] (an adaptive learning rate methods) is applied to speed up training. Meanwhile, the dropout technique is employed to avoid overfitting. The ReLU activation function [57] and cross entropy loss are employed, since they can both accelerate the model training and obtain better prediction results [59]. The batch normalization approach is also applied to reduce the dependency of training with the parameter initialization, speed up training and minimize the risk of overfitting. The following equations are used to calculate the loss: where n is the number of PPIs for batch training, m represents the individual network, and h 1 is the depth of two individual networks, h 2 is the depth of fused network. σ 1 is the activation function of ReLU, σ 2 is the activation function of the output layer with sigmoid, ⊕ represents the concatenation operator. X is the batch training inputs, H is the output of hidden layer, and y is the corresponding desired outputs. W is the weight matrix between the input layer and output layer, b is the bias.

Results and discussion
In this section, we briefly introduce several widely-used evaluation criteria for performance comparison, and the recommended configuration of experiments. Next, we analyze and discuss the experimental results and compare our results with those of other related work.

Evaluation metrics
To comprehensively compare the performance, six evaluation metrics are employed, accuracy (ACC), precision where true positive (TP) stands for the number of true PPIs which are correctly predicted; false negative (FN) stands for the number of true PPIs which are incorrectly predicted as non-interacting pairs; false positive (FP) is the number of true non-interacting pairs which are predicted as interacting pairs; true negative (TN) represents the number of true non-interacting pairs which are correctly predicted. MCC is considered as the most robust metric of a binary classifier. MCC equal to 0 represents completely random prediction, whereas 1 means perfect prediction. F 1 score is a harmonic average of precision and sensitivity, and a larger score indicates a better performance. Receiver operating characteristic (ROC) curve is also employed to assess the performance of prediction model. To summarize ROC curve in single quantity, the area under ROC curve (AUC) is used. AUC ranges from 0 to 1, the maximum value 1 stands for perfect prediction. For a random guess, the AUC value is close to 0.5.

Experimental setup
Our approach is implemented on TensorFlow platform https://www.tensorflow.org. We firstly constructed the negative datasets using four different strategies. We then encoded the amino acid sequences from the datasets using auto covariance (AC) [24]. After that, we trained two separate neural networks with graphics processing unit (GPU) based on the feature sets encoded by AC. Finally, we fused these two networks to predict new PPIs.
Deep learning algorithms contains a number of hyperparameters, which may heavily impact the experimental results. The recommended hyper-parameters configuration of our proposed model is summarized in Table 3.
As to the parameter specification of the comparing methods, we employed the grid search to obtain the optimal parameters, which are shown in Table 4. For Du et al. [27] work, they also provided with a similar hyper-parameters configuration with ours, which can be accessed via the reference [27].

Contribution of controlling degrees
For the negative dataset generated by NIP-SS, we select the top-m protein-protein pairs with the lowest sequential similarity as the negative PPIs. Among all protein pairs, the similarity between these protein pairs is minimum. However, there are some proteins having very large degrees, which will lead to a bias and overestimation of prediction results. Therefore, we need to control the degree distribution of the negative dataset, and approximate the distribution with that of the positive dataset to guarantee the generalization ability of negative examples. Table 5 reports the degree distribution of proteins in S. cerevisiae, H. sapiens and M. musculus, and Fig. 3 reveals the prediction performance of NIP-SS with and without controlling the degree of proteins related to negative samples. From Table 5, we can see that the maximum degree of proteins in the negative dataset (NIP-SS-NonControl) is 1439, and the proportion of non-interactions with degree larger than 150 is 27.39%, which may lead to a bias. As a result, using this datasets produce a higher accuracy of 97.05%. Compared to NIP-SS-NonControl, the negative dataset constructed by NIP-SS contains more proteins and smaller maximum degree. Meanwhile, noninteractions are mainly related to proteins whose degrees fewer than 50. As such, the negative dataset generated by NIP-SS has a better generalization ability and lower bias than that by NIP-SS-NonControl. The contribution of controlling the degree of proteins in the negative dataset is also significant on H. sapiens and M. musculus datasets.
If we directly select protein pairs whose corresponding entries equal to 0 in the updated W (k) to generate the negative dataset, such a dataset brings less predictive information and is not conducive for predicting PPIs, since this dataset contains many proteins with low degrees. Therefore, a sub-matrix W p×p is employed to control the degree distribution of proteins. In addition, k also affects the degree distribution. Given that, we need to specify suitable input values of p and k. Particularly, we firstly fix k to 3, and then tune p from 500 to 4382 with an interval of 500. Next, we calculate the average repeatability (r), maximum degree of proteins, the proportion of proteins in different ranges of degrees, and the proportion of non-interactions in each range. We then choose p that makes the degree of proteins in the negative dataset similar to those of the positive dataset. After that, we adopt p selected in the first step and tune k within {1, 2, 3, 6, 10, 50, 300, 1000}.
The degree distribution and prediction results on S. cerevisiae are shown in Table 6 and Fig. 4, respectively. From Table 6, we can see that when p ≈ 2000, the degree distribution of the negative dataset is most similar to that of the positive dataset. In addition, from Fig. 4, we can also observe that when n < 2000, the accuracy with the setting p = 500, 1000 and k = 3 is higher than 95%. This is because the average repeatability is large and leads to a bias. The similar parameter selection strategy is also conducted on the other two datasets. The experimental results and the degree distribution of proteins are shown in Fig. 4, Tables 6, and 7. According to Table 6, we set p = 700 and p = 300 for H. sapiens and M. musculus, respectively. In addition, according to the right of Fig. 4 and Table 7, we fix k = 3 for H. sapiens and k = 50 for M. musculus. From Table 6, we can also observe that when p is set to n/2 (or n/3, n is the number of proteins in the positive set), the degree distribution generally approximates well with that of the positive dataset.

Results of different negative dataset construction strategies
To investigate the effectiveness of the proposed two strategies for constructing negative dataset, we conduct experiments on three prevalent PPIs datasets, including S. cerevisiae, H. sapiens and M. musculus datasets, and take the performance of PPIs prediction as the comparing index. To avoid over-fitting and data dependency, five-fold cross-validation is adopted. Table 8 reports the average prediction results on these three species using different negative dataset generation strategies.
We can see that for the S. cerevisiae dataset, the model based on the negative dataset generated by NIP-SS gives the average accuracy of 94.34%, precision of 95.62%, recall of 92.96%, specificity of 95.74%, MCC of 88.73%, F 1 of 94.27% and AUC of 98.24%, respectively. These values are higher than those of other strategies, which separately adopt random walk, random pairing, subcellular localization to generate the negative dataset. These results prove the effectiveness of NIP-SS in generating reliable non-interacting protein pairs for PPIs prediction. In addition, the negative dataset constructed by NIP-SS contain more proteins and have similar degree distributions to the positive dataset, which can effectively control the bias of the dataset. The model trained on the negative dataset generated by random pairing yields very low accuracy of 74.20%. That is because this negative dataset has a high rate of false negatives, and the degree distribution mainly concentrates on proteins with degree smaller than 10. The model based on negative dataset generated by subcellular localization also yields a good performance with accuracy of 93.79%, MCC of 87.62%, and AUC of 98.13%. However, compared to the negative dataset generated by NIP-SS, this dataset covers fewer proteins and a larger proportion of non-interactions in the degree range 50-70, which are higher than those NIP-SS. Those will produce an over-optimistic estimate of prediction.
The model trained on the negative dataset generated by NIP-RW yields an average accuracy of 87.92%, MCC of 75.97% and AUC of 94.23%. These values are lower than those of NIP-SS. That is mainly because the proteins in the negative datasets generated by NIP-SS and NIP-RW have different degrees. 21.05% non-interacting protein pairs in the negative dataset generated by NIP-SS are located in range of degree larger than 50, but no non-interacting protein pairs in the negative dataset generated by NIP-RW are located in that range. Another reason is that random walk process is restricted by the connected positive examples. For the small network of H. sapiens and M. musculus datasets, NIP-RW yields good results.
As to the H. sapiens and M. musculus datasets, we can observe that the model based on the negative datasets of subcellular localization yields the best prediction accuracy of 93.34% and 91.82%, respectively. We find the negative datasets constructed by subcellular localization has the maximum average repeatability (r) and contains the fewest proteins, which lead to a bias and an overestimated performance. Since the degree distribution of negative datasets constructed by NIP-SS and NIP-RW are similar, the prediction performance using these two strategies are similar. The model based on negative datasets generated by random pairing again gives the lowest performance.
To further investigate the effectiveness of our model that uses two separate DNNs at first, we introduced a variant of our model called DNNs-Con. DNNs-Con firstly concatenates AC features of two individual proteins, and then Table 5 The degree distribution of proteins of different datasets on S. cerevisiae, H. sapiens, and M. musculus To check the statistical significance between our model and DNNs-Con, the pairwise t-test (at 95% significance level) is also used. The experimental results of five-fold cross validation are reported in Table 9. From Table 9, we can observe that the accuracy, MCC, F 1 and AUC of our model are 2.61%, 5.22%, 2.68% and 1.29% higher than those of DNNs-Con, respectively. In addition, we observe that our model converges faster than DNNs-Con during the training process, that is due to two separate networks can faster extract sequence information contained in each amino acid sequence. These results prove that our model (using two separate DNNs, instead of single one) is efficient and effective to predict PPIs. Based on the above analysis, we fix p = 2000 and vary k ∈ {1, 2, 3, 6, 10, 50, 300, 1000}. Figure 4 (right of this Figure) reports the results under different values of k. We also calculate the degree distribution at different k, which are listed in Table 7. From the right of Fig. 4, we can observe that when k ≥ 6, the result is close to 1. That is because there are more nonzero entries in W (k) as k increases, which change the degree distribution of proteins and thus bring in a larger bias. Table 7 shows the degree distribution when p = 2000. Based on these results, we fix k to 3.

The impact of of imbalanced class
In general, the number of negative PPIs has a large impact on prediction performance. To investigate the impact of imbalanced class on our proposed two strategies, three H. sapiens datasets are constructed with different numbers of negative samples for NIP-SS and NIP-RW, respectively. The ratios of positive samples (3355 interaction pairs) and negative samples in these three datasets are 1:1, 1:2 and 1:3, respectively. Four metrics of sensitivity (SEN), specificity (SPE), area under the receiver operating characteristic curve (AUC), and geometric mean (GM) are used to evaluate the prediction performance. GM is commonly used for class-imbalance learning [60], it can give a more accurate evaluation on imbalanced data. The GM is calculated by this formula: GM = √ SEN × SPE. The prediction results are shown in Table 10. From the Table 10, we can see that as the number of negative samples increases, the overall performance of the model shows a downward trend. In addition, the prediction values of AUC and GM decrease significantly. AUC is respectively decreased by 10.51% and 8.27% for NIP-SS and NIP-RW, and GM is decreased by 16.63% and 11.27%. Given that, to avoid the performance degradation caused by imbalanced class, we adopt the widely-used solution that uses the same number of negative PPIs as that of positive samples.

The impact of different feature descriptors
The extracted features can affect the performance of PPIs prediction [28]. To investigate the contribution of auto covariance (AC) [21] descriptor, we separately train DNNs on S. cerevisiae (the negative dataset constructed by NIP-SS) based on AC [21], CT [20], LD [23], and MCD [25]. Table 11 reports the results of five-fold cross Table 6 The degree distribution of proteins of different sizes of submatrix W p×p (k=3) on S. cerevisiae, H. sapiens, and M. musculus validation. Meanwhile, we also use pairwise t-test (at 95% significance level) to check the statistical significance between AC and CT, LD, MCD. From Table 11, we observe that DNNs-AC achieves an average accuracy as 94.25%, precision as 94.7%, recall as 93.75%, specificity as 94.74%, MCC as 88.5%, F 1 as 94.22%, and AUC as 98.15%. The performance difference of these descriptors is not significant, but AC descriptors have the smallest feature dimension. For this reason, we adopt AC to encode amino acid sequences.

Comparison with existing methods
To further study the performance of our model and the contribution of negative dataset generated by NIP-SS and NIP-RW, we compare our prediction results on S. cerevisiae with those of other competitive methods, including Guo et al. [21], Yang et al. [22], Zhou et al. [23], You et al. [25], and Du et al. [27]. These approaches were introduced in "Background" section. Table 12 shows the experimental results. Our method yields average prediction accuracy of 94.34%, precision of 95.62%, recall of 92.96%, MCC of 88.73%, F 1 of 94.27%, and AUC of 98.24%. Compared to the other two negative datasets, the negative dataset constructed by NIP-SS covers more proteins and the degree distribution is close to the degree distribution of the positive dataset. In addition, we can observe that the comparing methods using the negative dataset constructed by NIP-RW also produces good results. However, for a large datasest, the degrees of proteins in the negative dataset generated by NIP-RW are almost always smaller than 50. This is because the distribution of degree is restricted by the collected positive examples and a large network makes the random walk process less controlled. For this reason, the NIP-RW is reliable on H. sapiens and M. musculus. These results prove that the negative datasets constructed by NIP-SS and NIP-RW are rational and can boost the performance of PPI prediction.

Results on independent datasets
Six independent datasets, which just only contain the examples of interactions (non-interactions), including Caenorhabditis elegans (4013 interacting pairs), Escherichia coli (6954 interacting pairs), Helicobacter pylori (1420 interact-ing pairs), Homo sapiens (1412 interacting pairs), Mus musculus (313 interacting pairs), and Mammalian (1937 non-interacting pairs), are employed as test sets to evaluate the generalization ability, and to further assess the practical prediction ability of our model and the rationality of NIP-SS and NIP-RW. Three datasets of H. sapiens (3355 positive examples and 3355 negative examples) are constructed and the difference between these datasets is the negative samples, which are generated by NIP-SS, NIP-RW, and subcellular location, respectively. Then, three models with optimal configuration (provided in "Evaluation metrics" section) are trained on these three datasets. After that, these six independent datasets are used to test the generalization ability of these models. The prediction results are shown in Table 13. From     contribute to a good performance across species. We note that the accuracy on Mammalian using the NIP-SS and NIP-RW strategies are 3.36 and 3.98 times higher than that using subcellular localization (which is only 4.67%). Given that, we can conclude that the negative dataset generated by subcellular localization may produce a bias for predicting PPIs. In other words, subcellular localization based negative examples generation strategy is inclined to predict a new protein pair as interaction.To further demonstrate this discovery and the advantages of NIP-SS and NIP-RW, we constructed a dataset (named Mammalian-imbalanced), in which the number of negative samples is about 4 times than that of positive samples, since the number of protein pairs (non-interacting) is far greater than the number of interaction pairs in the real world. The negative samples are from Mammalian dataset (1937 negative samples), while the positive are from the M. musculus (313 positive samples). Finally, the dataset contains 313 + 1937 protein pairs. The prediction results are also shown in Table 13. From Table 13, we can see that the accuracy on Mammalian-imbalanced dataset using the NIP-SS and NIP-RW strategies are 23.45% and 27.56%, respectively, which are both higher than that using subcellular localization (only 17.75%). These prediction results show that NIP-SS and NIP-RW hold a good generalization ability and performance in predicting PPIs, and the strategies of subcellular location will lead to a bias in predicting.

Conclusion and future work
Effective PPIs prediction approaches depend on a high quality negative dataset (non-interacting protein pairs), which contributes to discriminative and accurate prediction. In this paper, we present two novel strategies (NIP-SS and NIP-RW) to generate high-quality negative dataset and to boost the performance of PPIs prediction. NIP-SS uses sequence similarity between proteins to guide the generation of negative examples, whereas NIP-RW utilizes the interaction profiles of proteins to select negative examples. To reduce the bias and enhance the generalization ability of the generated negative dataset, these two strategies separately adjust the degree of the noninteracting proteins and approximate the degree to that of the positive dataset. We found that NIP-SS is competent on all datasets and hold a good performance, whereas NIP-RW can only obtain a good performance on small dataset (positive samples ≤ 6000) because of the restriction of random walk and the results of extensive experiments. In addition, these experiments also indicate that the negative datasets constructed by NIP-SS and NIP-RW can significantly improve the performance of PPIs prediction and these two strategies work better than other two widely adopted strategies. We will fuse multiple types of biological data, including the sequence similarity, functional similarity and domain similarity of proteins, to generate the negative datasets. In addition, we will investigate more intelligent ways to adjust the degree of non-interacting proteins.