Identifying protein-protein interface via a novel multi-scale local sequence and structural representation

Background Protein-protein interaction plays a key role in a multitude of biological processes, such as signal transduction, de novo drug design, immune responses, and enzymatic activities. Gaining insights of various binding abilities can deepen our understanding of the interaction. It is of great interest to understand how proteins in a complex interact with each other. Many efficient methods have been developed for identifying protein-protein interface. Results In this paper, we obtain the local information on protein-protein interface, through multi-scale local average block and hexagon structure construction. Given a pair of proteins, we use a trained support vector regression (SVR) model to select best configurations. On Benchmark v4.0, our method achieves average Irmsd value of 3.28Å and overall Fnat value of 63%, which improves upon Irmsd of 3.89Å and Fnat of 49% for ZRANK, and Irmsd of 3.99Å and Fnat of 46% for ClusPro. On CAPRI targets, our method achieves average Irmsd value of 3.45Å and overall Fnat value of 46%, which improves upon Irmsd of 4.18Å and Fnat of 40% for ZRANK, and Irmsd of 5.12Å and Fnat of 32% for ClusPro. The success rates by our method, FRODOCK 2.0, InterEvDock and SnapDock on Benchmark v4.0 are 41.5%, 29.0%, 29.4% and 37.0%, respectively. Conclusion Experiments show that our method performs better than some state-of-the-art methods, based on the prediction quality improved in terms of CAPRI evaluation criteria. All these results demonstrate that our method is a valuable technological tool for identifying protein-protein interface.


Background
In biological processes, many proteins carry out the special biological functions through protein-protein interactions, such as drug design and functional analysis. Gaining insights of various binding abilities can deepen our understanding on protein-protein interface. Determination of binding sites is widely applied in molecular biology research. It is of great interest to understand how *Correspondence: fguo@tju.edu.cn † Fei Guo, Quan Zou and Junhai Xu are joint corresponding authors. 1 College of Intelligence and Computing, Tianjin University, Tianjin, People's Republic of China Full list of author information is available at the end of the article proteins bind with each other, which helps us understand energetics and mechanisms of complexes. How to build more effective models based on sequence information, structure information and physicochemical characteristics, is the key technology for identifying protein-protein interface. There are many efficient techniques for the protein-protein interface prediction [1][2][3][4][5][6][7][8][9][10][11]. Some approaches use machine learning methods and statistical methods to analyze the differences between interface residues and non-interface residues on the surfaces [12][13][14][15]. ProMate [16] creates the circle around each surface residue, which can be used to extract the statistical histogram of many features. Then, it estimates the probability of each circle to be on the interface, and some circles with high probability values are clustered to identify binding residues. PPI-Pred [17] generates an interacting patch and a non-interacting patch for each training protein, and extract several features from these patches to build an SVM model for predicting the interacting patch in each testing protein. PINUP [18] proposes an empirical scoring function, including interface propensity and residue conservation score. It calculates the occurrence of each top scoring spot, therefore predicts residues on interface spots. Meta-servers combine the strengths of some existing approaches: meta-PPISP [19] combines three prediction servers; metaPPI [20] combines five identification methods. ProBiS [21,22] predicts protein-protein interface by local structure alignment. It compares the information of a testing protein to some binding sites in the known database, for detecting similar structural residues.
Another kind of methods check the possible poses of two subunits; that is, how these subunits may dock. Docking methods based on fast Fourier transformation (FFT) [23], geometric surface matching [24], as well as intermolecular energy [25] have been proposed. The general approach is to explore all possible poses, and use one energy function to identify near-native poses. The problem of exploring all possible poses has been well-solved by some methods [26][27][28]. The key issue here is to design an energy function based on various properties and features that can identify near-native poses, such as hydrophobic and conserved polar at specific locations [29], hydrogen bonds and salt bridges [30], secondary structure composition [31], relative surface area burial and weighted hydrophobicity [32], force field energy evaluation [33][34][35]. FRODOCK 2.0 [36] presents an user-friendly proteinprotein docking server based on an improved version including a complementary knowledge-based potential. InterEvDock [37] is a server for protein docking based on a free rigid-body docking strategy, intergrating coevolutionary information. SnapDock [38] is a highly efficient template-based protein-protein docking algorithm, utilizing the interface PIFACE library. CIPS [39] proposes a new pair potential combining interface composition with residue-residue contact preference, screening docking solutions obtained either with all-atom or with coarsegrain rigid docking. ZRANK [40,41] combines an atombased potential (IFACE) with five residue-based potentials for ranking solutions. It provides fast and accurate rescoring models from ZDOCK. ClusPro [42] develops a fast algorithm for filtering docked conformations with good surface complementarity and ranking them based on their clustering properties. RosettaDock [43] constructs the energy function by using van der Waals energies, orientation-dependent hydrogen bonding, implicit Gaussian solvation, side-chain rotamer probabilities and a low-weighted electrostatics energy. HADDOCK [44] makes use of the biochemical and biophysical interaction data, such as chemical shift perturbation data resulting from NMR titration experiments.
In this paper, we calculate the local information on the protein-protein interface, through multi-scale local average block and hexagon structure construction. Given a pair of input proteins, we use the trained support vector regression (SVR) model to select best protein-protein docking poses. Experiments show that our method achieves better results than some state-of-the-art methods. Here, we use the CAPRI evaluation criteria [45], I rmsd value and F nat value. On Benchmark v4.0 [46], our method has average I rmsd value of 3.28Å and overall F nat value of 63%. On the CAPRI targets, our method has average I rmsd value of 3.45Å and overall F nat value of 46%. The success rates by our method on Benchmark v4.0 are 41.5%. Comparing to the existing methods, our method is a valuable technological tool for identifying protein-protein interface.

Methods
We find the relative orientation and position between two subunits, and each relative orientation and position combination is referred to as a configuration or pose. Given a configuration, we can determine the interface region between two subunits and fix the orientation as well as position of the regions far from the interface.
Here, we utilize our previous enumeration method [47] to identify the docking configurations of two subunits. It performs a large number of rigid transformations to enumerate the poses. Then, we design a novel energy function and build a trained SVR model to evaluate docking poses and select the top-ranking poses with lowest energy values. The flowchart is shown in Fig. 1.
In this paper, our main work is to obtain the local information on protein-protein interface for energy evaluation. First, each pair of proteins can be encoded with physicochemical property and position specific scoring matrix. Then, we establish two novel models, multi-scale local average block and hexagon structure construction, for representing local sequence and structural information on protein-protein interfaces. Finally, our proposed properties can be effectively applied to identify docking poses, as well as existing energy items.

Physicochemical property
We can use six physicochemical properties [48,49] to extract protein features, since one protein can be represented by a vector of physicochemical property. These physicochemical properties are analyzed as hydrophobicity (H), volumes of side chains of amino acids (VSC), polarity (P1), polarizability (P2), solvent-accessible surface area (SASA) and net charge index of side chains (NCISC) of amino acid, respectively. The physicochemical property  Table 1. They can be normalized to zero mean and unit standard deviation (SD) as follows: where P i,j is the value of physicochemical property j for amino acid type i, P j is the mean over 20 amino acid types of physicochemical property j, and S j is the corresponding standard deviation of physicochemical property j.

Position specific scoring matrix
The protein evolutionary information can be described by Position Specific Scoring Matrix (PSSM), generated by PSI-BLAST [50]. Given a protein, the PSSM information is stored in the L×20 matrix (protein length: L; amino acid types: 20), calculated as follows: where ω(i, k) is the frequency of amino acid type k at the position i, and D(k, j) is the value of Dayhoff 's mutation matrix (substitution matrix) [51] between amino acid types of k and j. These PSSM elements can be normalized in a range of [ 0, 1] using the min-max normalization as follows: where PSSM max and PSSM min represent the maximal and minimal elements of PSSM.

Multi-scale local average block
We utilize Multi-scale Local Average Block (MLAB) algorithm to extract the conserved information of local regions. The original Average Block (AB) algorithm was proposed by Jeong et al. [52]. Different from the original AB algorithm, we use multi-scale size to split the matrix horizontally. The MLAB features can describe the local relationship between target residue and neighboring residues. Given a residue R, we denote R −1 , R −2 , ..., R −5 be the five residues before R in the sequence, and R +1 , R +2 , ..., R +5 be the five residues after R in the sequence. Then, R ±1 , R ±2 , ..., R ±5 are referred to as the ten sequential neighbors. We split the information of target residue into six local sequential regions with varying composition, via global zone (A), bisection (B and C) and trichotomy (D, E and F). These local regions can describe multiple overlapping continuous and discontinuous interaction patterns, shown in Fig. 2. We calculate the mean of each local block as follows: where L(k, j) is the mean of k-th block in the column j, B L k is the total number of rows in block k, and M L k (i, j) is the value of cell in i-th row and j-th column of block k.

Hexagon structure construction
We build the hexagon structure for each target residue to describe its neighborhood information, as demonstrated in Fig. 3. We assume that C α is the origin, C β is along the positive direction of y-axis, and N is on the x-y plane where x is positive. The 3D space is partitioned along yaxis into six equal subspaces by three planes, and the angle between any two planes is 60 • . Given a residue R, we locate nearest non-local C α to C α of residue R within a certain distance in each subspace. Here, we say a residue is nonlocal to residue R if and only if it is separated by at least three residues from residue R in sequence. We call these six residues as spatial neighbors of residue R, denoted as H 1 R , H 2 R , ..., H 6 R . We split the hexagon structure of target residue into six local spatial regions with varying composition, via global zone (A), bisection (B and C) and trichotomy (D, E and F). We calculate the mean of each local space as follows: where H(k, j) is the mean of k-th space in the column j, B H k is the total number of rows in space k, and M H k (i, j) is the value of cell in i-th row and j-th column of space k.

Extracting interface residues
The above proposed features can be effectively applied to extract protein-protein interface residues and identify docking poses, as well as existing energy items. The energy items are listed as follows: • amino acid contact energy -amino acid probabilities of interface residues [53]. • secondary structure contact energy -secondary structure probabilities of interface residues [53]. • structural neighborhood energy -probability of structural neighboring property on interface [54]. • dihedral angle energy -statistical analysis of dihedral angle correlation on interface [55]. • π-π interaction energy -geometrical property on π-π interaction [55]. • multi-scale local average block on protein 1D sequence. • hexagon structure construction on protein 3D structure.
We use a trained support vector regression (SVR) model to rank docking poses, and then report the top-ranking poses with lowest energy values [56][57][58]. For the training set, we use I rmsd (rmsd value between predicted interfaces and native complexes) as the response values for all configurations of each pair of proteins, and the above energy items can be regarded as seven groups of features for each pose. Some configurations with the lowest predicted response values can be reported as the final result on the testing set. For a given pair of proteins, we use the trained SVR model to select top 10 predictions with lowest energy values.

Results
In this section, we compare our method to many existing methods for identifying protein-protein interfaces. Experiments show that our method performs better than some state-of-the-art methods on Benchmark v4.0 and the CAPRI targets, based on the prediction quality improved in terms of CAPRI evaluation criteria.

Evaluation criteria
A complex may contain several subunits and multiple binding interfaces. Each binding interface in a complex occurs in a pair of subunits. Two residues between a pair of subunits are called interface residues, if any two atoms, one from each residue, interact. By interacting, the distance between two atoms from a pair of different residues is less than 6Å.
According to CAPRI evaluation criteria [45], three evaluation measures are commonly used in protein-protein interface prediction. A pair of residues on different sides of interface is considered to be in contact if any of their atoms are within 6Å. One is the fraction of native contacts F nat , defined as the number of correct residue-residue contacts in the predicted configuration divided by the number of contacts in the native complex. The other is the fraction of non-native contacts F non−nat , defined as the number of incorrect residues-residue contacts in the predicted configuration divided by the total number of contacts in that predicted pose. The third is rootmean-square deviation of interface I rmsd , defined as rmsd value between all backbone atoms of interfaces in predicted pose and in native complex, after two interfaces are superimposed.

Statistical analysis
We analyze different regression models and evaluate the performance of energy items on CAPRI [45]. CAPRI is a community-wide experiment to assess the capacity of docking methods.

Assessment of regression model
To assess the effectiveness of regression model, we analyze the performance of Support Vector Regression [59] and Linear Regression [60] with same energy items on CAPRI, and the results are shown in Fig. 4. The average I rmsd value for cases by Support Vector Regression is 3.45Å. The average I rmsd value for cases by Linear Regression is 3.57Å. It confirms our hypothesis that Support Vector Regression can accurately identify the protein-protein interface.

Assessment of energy items
To assess the effectiveness of energy items, we analyze the performance of different cases on CAPRI. We reevaluate configurations selected by different energy items, and the results are shown in Fig. 5. The average I rmsd value for cases with sequence contact energy (amino acid contact energy, secondary structure contact energy) is 3.63Å. The average I rmsd value for cases with structural interaction energy (structural neighborhood energy, dihedral angle energy, π-π interaction energy) is 3.57Å. The average I rmsd value for cases with multi-scale local energy (multi-scale local average block on protein 1D sequence, hexagon structure construction on protein 3D structure) is 3.51Å. Average I rmsd values for these cases are less than that for cases with all energy items (3.45Å). It confirms our hypothesis that the multi-scale local representations

Docking validation
We evaluate the performance of our method on the protein-protein complexes in Benchmark 4.0 [46]. All targets in Benchmark 4.0 are classified into three categories: rigid-body (easy) cases, medium difficult cases and difficult cases, according to the magnitude of conformational change after binding. Our method is compared to Snap-Dock [38], InterEvDock [37] and FRODOCK 2.0 [36]. The success rate reports the percentage of cases for which at least one out of top 10 predictions is an acceptable or better solution on CAPRI criteria. The protein-protein docking results of different methods are shown in Table 2. The success rates by our method, FRODOCK 2.0, InterEvDock and SnapDock on Benchmark v4.0 are 41.5%, 29.0%, 29.4% and 37.0%, respectively. Our method improves the success rate at least by 4.5%.

Protein-protein interface prediction
In this study, we compare our predicted interfaces with ZRANK [40,41] and FiberDock(external tool) [28], and also with ClusPro [42]. We consider 79 complexes from Dockground [61] as the training set. In order to avoid over-fitting, we exclude complexes sharing more than 30% identity with cases in testing set. The average I rmsd value is 1.49Å, and the overall F nat and F non−nat values are 85% and 16%.

Evaluation on benchmark v4.0
On Benchmark v4.0, our method achieves average I rmsd value of 3.28Å and overall F nat value of 63%, which   Table 3. The complexes are classified into three categories, according to the magnitude of conformational change after binding. In rigid-body group, our method achieves average I rmsd value of 2.86Å and overall F nat value of 69%, which improves upon I rmsd of 3.31Å and F nat of 56% for ZRANK, and I rmsd of 3.33Å and F nat of 55% for ClusPro. In medium difficulty group, our method achieves average I rmsd value of 3.35Å and overall F nat value of 59%, which improves upon I rmsd of 4.46Å and F nat of 39% for ZRANK, and I rmsd of 4.71Å and F nat of 30% for ClusPro. In difficulty group, our method achieves average I rmsd value of 5.39Å and overall F nat value of 36%, which improves upon I rmsd of 6.18Å and F nat of 28% for ZRANK, and I rmsd of 6.53Å and F nat of 21% for ClusPro.

Evaluation on Capri
We evaluate protein-protein interface prediction by our method, ZRANK and ClusPro on CAPRI. On 35 CAPRI targets, our method achieves average I rmsd value of 3.45Å and overall F nat value of 46%, which improves upon I rmsd of 4.18Å and F nat of 40% for ZRANK, and I rmsd of 5.12Å and F nat of 32% for ClusPro. Our method predicts 9 incorrect, 12 acceptable, 12 medium, 2 high quality results. ZRANK+FiberDock predicts 14 incorrect, 7 acceptable, 7 medium, 7 high quality results. ClusPro predicts 13 incorrect, 11 acceptable, 8 medium, 3 high quality results.

Binding sites identification
Some existing methods use machine learning and statistical approaches to predict binding sites. Each comparison with an existing method is performed using the test data by the compared method in the literature.

Comparison to metaPPI, meta-PPISP and pPI-Pred
In this experiment, the test data in metaPPI [20] is used to predict binding sites. The data consists of 41 complexes, divided into two categories: enzyme-inhibitor (EI) and others. The overall F nat and F non−nat values for each prediction method are shown in Table 4. The overall F nat values for our method, metaPPI, meta-PPISP and PPI-Pred achieve 62%, 28%, 38% and 38%, respectively. The overall F non−nat values for these four methods achieve 34%, 51%, 54% and 64%, respectively. Our method improves the overall F nat value by at least 24%. The average sizes of predicted interface residues for our method, metaPPI, meta-PPISP and PPI-Pred are 22.1, 13.2, 18.2 and 27.8, while the average size of actual interface residues is 22.7. The number of residues predicted correctly for these four methods are 12.9, 5.5, 7.5 and 8.2.

Comparison to proMate and pINUP
Our method is compared to ProMate and PINUP. The test data is originally used by ProMate [16], including 57 unbound proteins and their complexes. The results are reported in Table 5. The overall F nat values for our method, PINUP and ProMate achieve 60%, 42% and 13%, respectively. The overall F non−nat values for these three methods achieve 45%, 55% and 47%, respectively. Our method improves the overall F nat value by at least 19%. The average sizes of predicted interface residues for our method, PINUP and ProMate are 25.6, 19.0 and 5.4, while the average size of actual interface residues is 22.6. The number of residues predicted correctly for these three methods are 12.6, 8.3 and 2.7.

Case study
We evaluate interface prediction of our method on two different cases.

Interface prediction on sK/RR interaction
We study HisKA domain of sensor histidine kinase (PF00512) and its partner response regulator domain (PF00072) in Pfam database [62]. Interface identification

Discussion
Lots of protein-protein identification approaches are based on analyzing some different features, such as sequence and structural properties, as well as other physicochemical properties. Most of the features only describe the property of current interacting residues, but cannot represent real situation well, thus are insufficient to predict interface residues with high accuracy. Although many computational methods have been used to predict protein-protein interfaces, the effectiveness and robustness of previous prediction models can still be improved. Main improvements of our proposed method come from adopting the effective feature extraction models that can capture useful protein information.
All results demonstrate that our method is a valuable technological tool for identifying protein-protein interface.

Conclusions
We identify two new features: multi-scale local average block and hexagon structure construction. Given a pair of proteins, we use the trained SVR model to select best poses. From experimental results, the prediction ability of our method is better than that of other existing state-of-the-art approaches. It demonstrates that our proposed method is a very promising and useful support tool for future proteomics research. In the future work, Fig. 6 Our method detects the binding residues on SK/RR interaction. Interface residues are described in red boxes and non-interface residues are described in black boxes

Fig. 7
Our method detects the binding residues on spirulina platensis. Interface residues are described in red boxes and non-interface residues are described in black boxes we will extend our method to predict important special complexes.