Exploring the potential of 3D Zernike descriptors and SVM for protein–protein interface prediction

Daberdaku, Sebastian; Ferrari, Carlo

doi:10.1186/s12859-018-2043-3

Research Article
Open access
Published: 06 February 2018

Exploring the potential of 3D Zernike descriptors and SVM for protein–protein interface prediction

BMC Bioinformatics volume 19, Article number: 35 (2018) Cite this article

5698 Accesses
39 Citations
Metrics details

Abstract

Background

The correct determination of protein–protein interaction interfaces is important for understanding disease mechanisms and for rational drug design. To date, several computational methods for the prediction of protein interfaces have been developed, but the interface prediction problem is still not fully understood. Experimental evidence suggests that the location of binding sites is imprinted in the protein structure, but there are major differences among the interfaces of the various protein types: the characterising properties can vary a lot depending on the interaction type and function. The selection of an optimal set of features characterising the protein interface and the development of an effective method to represent and capture the complex protein recognition patterns are of paramount importance for this task.

Results

In this work we investigate the potential of a novel local surface descriptor based on 3D Zernike moments for the interface prediction task. Descriptors invariant to roto-translations are extracted from circular patches of the protein surface enriched with physico-chemical properties from the HQI8 amino acid index set, and are used as samples for a binary classification problem. Support Vector Machines are used as a classifier to distinguish interface local surface patches from non-interface ones. The proposed method was validated on 16 classes of proteins extracted from the Protein–Protein Docking Benchmark 5.0 and compared to other state-of-the-art protein interface predictors (SPPIDER, PrISE and NPS-HomPPI).

Conclusions

The 3D Zernike descriptors are able to capture the similarity among patterns of physico-chemical and biochemical properties mapped on the protein surface arising from the various spatial arrangements of the underlying residues, and their usage can be easily extended to other sets of amino acid properties. The results suggest that the choice of a proper set of features characterising the protein interface is crucial for the interface prediction task, and that optimality strongly depends on the class of proteins whose interface we want to characterise. We postulate that different protein classes should be treated separately and that it is necessary to identify an optimal set of features for each protein class.

Background

Proteins carry out a broad range of functions in living organisms such as structural support, signal transmission, immune defence, transport, storage, biochemical reaction catalysis and motility processes. The majority of proteins does not act in isolation: in fact they express their biological roles by interacting with other molecules [1]. Protein–protein interactions (PPIs) are of particular interest as they tell us how proteins come together to construct metabolic and signalling pathways in order to fulfil their functions [2]. Dysfunction or malfunction of pathways and alterations in protein interactions have shown to be the cause of several diseases such as neurodegenerative disorders [3] and cancer [4], and hence the identification of the exact location on a protein’s surface where it is likely to bind to its partners, i.e. the binding interface, has become one of the most popular targets for rational drug design [5]. In addition to practical applications, reliable identification of protein–protein interfaces is an important goal for basic research on the mechanisms of macromolecular recognition. For instance, PPI interface predictions can greatly aid protein–protein docking algorithms by being used in scoring functions or to constrain the available search space [6–8].

There are several experimental techniques available which can be employed for the characterisation of protein–protein interfaces at residual and even atomic level. For instance, both X-ray crystallography [9, 10] and nuclear magnetic resonance (NMR) spectroscopy [11] have been used to determine protein interfaces at atomic level. Cryo-electron microscopy [12] has increasingly gained popularity as it allows the examination of native structural features of hydrated molecules in solution. Other techniques provide structural elucidation of interactions at lower resolutions. Alanine scanning mutagenesis [13], Hydrogen/Deuterium exchange [14] and chemical cross-linking [15] have been used to experimentally characterize protein–protein interfaces at residue level.

Although impressive progress has been made, there are several limitations to the existing experimental methods in the determination of protein–protein interfaces. X-ray crystallography requires crystallizing the specimen and placing them in non-physiological environments, which can be inherently difficult and occasionally lead to functionally-irrelevant conformational changes. NMR spectroscopy is suitable for macromolecules in solution (closer to real functional environments or foldings) and can yield information on the dynamics of various parts of a given the protein or complex, but its applicability is limited to small polypeptides (less than 50 kDa). Cryo-electron microscopy has no sample size constraints and can guarantee a reduced radiation damage to the sample compared to X-ray crystallography, but is generally more difficult, time consuming, and requires operating constantly at temperatures lower than –135°C. These technical challenges make such experiments both labour-intensive and time-consuming, while on the other hand, the ongoing proteomics and structural genomics research continues producing large amounts of data, which need to be interpreted in a timely manner. Efficient computational methods are therefore needed to correctly predict the potential binding sites for a deeper understanding of PPIs.

Several computational methods for the prediction of PPI sites are available to date [16] which can be roughly categorised into sequence-based and structure-based approaches [17, 18]. In sequence-based methods, a sliding window of fixed length (typically varying from 3 to 30 residues) is scanned across the protein sequence and a number of overlapping local sequence segments are extracted. For each of these segments, a feature vector is constructed using various amino acid properties (physicochemical, statistical and structural features), and is used as the input of a classification problem. These methods are particularly useful as they allow the PPI site prediction when a protein’s structure information is not yet available.

In [19], a two-stage classifier is employed consisting of a Support Vector Machine (SVM) and a Bayesian network classifier that identifies interface residues primarily on the basis of sequence information. A 9-residue-long sliding window is employed, which is encoded using a 20 bit per residue feature vector (180 bit) for the first stage, and a 1 bit per residue (excluding the central one) feature vector (8 bit) for the second stage. In [20], a sliding window approach is combined with a Random Forests classifier to predict protein interaction sites using sequence information, both alone and in combination with structure-derived parameters. The input feature vectors were derived using a window length of 9 residues and employing 17 features per residue. Murakami and Mizuguchi predict interaction sites in protein sequences with a Naïve Bayes classifier using sequence features only: a position-specific scoring matrix (PSSM) and the predicted accessibility [21]. In [22], 24 independent neural network models are built using sparsely encoded sequence features for each amino acid (20-dimensional binary encoding for each residue) and a PSSM, and the average score of the 24 predictors is returned as the final score. Sriwastava et al. employ 21-residue-long local sequence segment pairs of protein sequences to identify interaction sites in protein complexes [23]. The input samples are built by assigning 8 properties to each residue in the local sequence segment pair, yielding 2×21×8=336-dimensional feature vectors classified by an SVM. In [24], a wide range of features (physicochemical properties, evolutionary conservation, amino acid distances and a PSSM) is extracted from protein sequences without using any structure data, then, a random forest-based integrative model is employed to effectively utilize these features and to deal with imbalanced data. Garcia-Garcia et al. propose a sequence-based computational method that infers possible interacting regions between two proteins by searching minimal common sequence fragments of the interacting protein pairs [25]. A two-dimensional matrix is derived by computing a score for each pair of residues that relates to the presence of similar regions in interolog protein pairs. The potential interface regions are reflected in query proteins by representing the scoring matrix as a heat map.

Structural features associated with the atomic coordinates of proteins are important discriminative attributes for PPI interface prediction, and the absence of such information is therefore expected to reduce the performance of sequence-based predictors compared to structure-based ones. For instance, most interface residues are also located on the protein surface, so structure-based methods can simply identify surface residues and ignore all internal residues. PPI interfaces are comprised of residues that can be located close to each-other in 3D space, while having distant positions in the primary sequence of the proteins. Finally, geometrical complementarity can be evaluated from 3D structures. Structure-based computational approaches offer several advantages over sequence-based ones, but are limited by the availability of protein 3D structures. However, the number and quality of available protein 3D structures has been steadily increasing over the past years and several structural repositories are available to date (i.e. Protein Data Bank (PDB) [26], The PeptideAtlas Project [27], Global Proteome Machine Database (GPMD) [28], The Proteomics Identifications database (PRIDE) [29]), enabling the development of structure-based interface predictors. Currently, most structure-based machine learning interface predictors exhibit better performance than sequence-based methods [16].

Porollo and Meller use “fingerprints” derived from the difference between the predicted and actual relative accessible surface area (rASA) of residues as features for interface prediction [30]. The prediction of PPI sites is done by a consensus method that combines the output of 10 Neural Networks with majority voting. Kufareva et al. developed an alignment-independent method of PPI interface prediction from local statistical properties of the protein surface at the atomic-group level [31]. The classification is done using a partial least-squares regression algorithm on the solvent accessibility values of 12 significantly over-represented and under-represented atomic groups at the interface, and can be further complemented by evolutionary conservation scores. In [32], interface regions for a query protein are determined by clustering and ranking the known interfaces in structural homologs. Zhang et al. propose a structural homology-based PPI interface prediction method [33]. For each query protein, its structural neighbours are identified by structural alignment, and their interface is mapped onto the query protein structure. The frequency of the mapped contacts are calculated for each residue in the query protein, and a logistic function is used to normalize the contact frequencies and generate the final prediction score for each residue. In [34], information from both proteins in a complex is used to predict pairs of interacting residues from the two proteins. Sequence (PSSM and predicted rASA) and structure (rASA, residue depth, half sphere amino acid composition, protrusion index) information about residue pairs is captured through pairwise kernels that are used for training a SVM classifier.

Experimental evidence supports the hypothesis that the location of binding sites is imprinted in the structures of proteins, and that this information can be extracted even without the knowledge of the binding partner [17, 35]. Interface surface portions share common physicochemical properties which distinguish them from the non-interface ones, thus, only specific areas of the protein surface are amenable to be engaged in PPIs. It has been observed that interaction sites are characterised by a high number of hot spots, i.e. energetically critical residues that contribute significantly to the free energy of binding [36]. Clusters of hydrophobic residues [37] and aromatic side chains [38, 39] are more abundant in the binding site, while hydrophilic residues are infrequent. Aromatic residues can form strong hydrophobic interactions between the bulky hydrophobic side chains, and the parallel arrangement of two aromatic rings creates tighter packing with better geometric fit. Cys–Cys residue contacts and the contacts between residues with opposite charges are more frequent in PPI sites [39]. Besides, protein interface regions are less flexible [40] and demonstrate higher sequence conservation rates [38, 41] than other non-binding regions. Conserved interfaces are critical for the maintenance of PPIs throughout evolution.

There are also differences among the interfaces of the various types of PPIs [2]. Depending on the interaction type and its function, the properties that characterise interfaces can vary a lot. For instance, various classes of PPIs differ on the interface propensities of residues [42]. Interfaces of homodimers (complexes made of identical protein chains) are rich in nonpolar and aromatic residues while depleted in polar and charged residues [43], except for Arg which is not excluded in spite of its charge [44]. Interfaces of permanent complexes (i.e. complexes where the constituent proteins remain irreversibly bound after the initial interaction) are more hydrophobic if compared to those of transient complexes (the two proteins can associate and dissociate during their lifetime) [45]. Proteins forming transient complexes should be stable on their own, thus their interfaces are less hydrophobic. The interfaces of obligate complexes (i.e. stable complexes whose constituent proteins do not exhibit well-folded structure when apart) present higher sequence conservation rates [46] and are more hydrophobic [47] than transient complexes. Salt-bridges and hydrogen bonds occur more frequently in the interfaces of transient complexes [2] while covalent disulphide bridges are quite rare, as they can be found in a few, relatively small, permanent complexes [48].

Proteins belonging to the same functional category recognize their interacting partners by certain types of molecular interactions that are specific to their protein family and local environments. As a result, proteins can show specific binding interactions according to their functional classes of PPI interfaces. In [49], basic differences between homodimeric, heterodimeric, protein–antibody and enzyme–inhibitor protein complexes are explored. Cho et al. [50] showed that three functional classes of transient complexes could be distinguished by only four interaction types (NH ⋯ NH, ion–ion, amine–cation and C^α − H ⋯ O = C). Moreover, C^α − H ⋯ O = C interactions were found to be predominant in protease–inhibitor interfaces while ion–ion interactions were found to be specific to signal transduction complexes. In [51], six types of PPI interfaces were studied and significant differences were found in their residue composition and their residue–residue contact preferences, in the interactions between permanent and transient interfaces, and between interactions associating homo-oligomers and hetero-oligomers. Antibody–antigen complexes were found to exhibit quite peculiar binding mechanism, as they do not undergo correlated mutations (the antibody adapts to bind a particular antigen) and their amino acid contact propensities are quite different from those of other protein complexes [52].

Although significant research has been done in the area of protein–protein interactions, the problem of PPI interface prediction is still not fully understood [23]. The selection of an optimal set of biological and physico-chemical features characterising the protein surface is one of the main unresolved issues. There are no known features which can singularly distinguish between interface and non-interface regions of the protein surface, and, the complex, non-linear combinations of features required to describe interaction sites can vary widely from one class of PPIs to another. Moreover, protein interface prediction is an imbalanced classification problem, because the the number of interacting residues of a protein is generally much smaller than that of non-interacting ones. Despite these limitations, several computational methods were reported to achieve good performance in the task of interface prediction for specific protein classes. In [53], Gao et. al. predict interface residues in enzymes with a Random Forest classifier employing the maximum relevance minimum redundancy method followed by incremental feature selection. In [54], a genetic algorithms which searches for known interface 3D templates is used to predict enzyme binding sites. In [55], B-cell epitopes (antigen interface) are predicted from the corresponding protein sequence using a combination of two classifiers, a naïve Bayesian and a random forest classifier, through a voting algorithm. Jespersen et. al. predict B-cell epitopes from antigen sequences with a random forest algorithm trained on the interfaces of known antibody–antigen protein complexes [56]. In [57], paratope (antibody interface) prediction is carried by deriving a set of consensus regions from the structural alignment of known sequentially similar antibodies. In [52], antibody-specific statistics are used to annotate residues with a score indicating their likelihood to belong to the antibody paratope.

In view of the above, we decided to perform binding interface prediction on different classes of proteins in order to gain a better understanding of the various PPI interfaces. In this work we introduce a methodology for the binding interface prediction of proteins given their experimentally-solved 3D structures (PDB files), without any knowledge on their possible binding partners. In order to effectively discriminate between interacting sites and non-interacting sites, we used a set of eight high quality amino acid indices (HQIs) of physico-chemical and biochemical properties extracted from AAindex1 dataset and first introduced in [58]. This set of properties has been employed and validated in several recent publications [23, 59–63]. We mapped these HQIs onto the voxelised representation of the protein surface, obtaining a geometrical representation of the latter enriched with the physico-chemical and biochemical properties of the underlying residues. Spherical patches are then uniformly sampled from the protein surface and, for each patch, a rotationally invariant local descriptor based on 3D Zernike moments is computed. The 3D Zernike descriptors (3DZDs) possess several attractive features such as a compact representation, rotational and translational invariance, and have been shown to adequately capture global and local protein surface shape [64–66] and to naturally represent physico-chemical properties on the molecular surface [67]. 3DZDs are employed to quickly evaluate the shape and physico-chemical similarity of local surface patches, since similar patches have similar descriptors. In order to handle the class imbalance between interface and non-interface local surface patches, we used a combination of undersampling of the majority class and oversampling of the minority class. We employed the stability selection method know as Randomized Logistic Regression as a feature selection algorithm on the 3DZDs in order to reduce the overall number of features. The resulting reduced descriptors were then used as samples for a binary classification problem: Support Vector Machines were used as a classifier to distinguish interface local surface patches (surface patches belonging to the protein–protein interaction interface) from non-interface ones. This is the first time that 3D Zernike descriptors of eight HQIs mapped on the corresponding protein surfaces are employed in the prediction of PPI interfaces. The proposed method was tested and validated on 16 classes of proteins obtained from the Protein–Protein Docking Benchmark 5.0, for both their bound and unbound states and compared to other state-of-the-art protein interface predictors.

Methods

Protein surface representation

In this work we employed the voxelised representation of the Solvent Excluded surface (SES) [68], which can be defined as follows. If we imagine a probe-sphere of radius equal to the size of the solvent molecule as it rolls over the external atoms of the protein, we can define the SES as the union of two surfaces: the portion of the outer atoms’ surface touched by the probe-sphere while it rolls over them, and the inward-facing surface portions of the probe when it touches two or more atoms. The SES represents a continuous functional surface of the molecule, i.e. the surface that is available to interact with. Voxelised surface representations (also known as dot-surfaces or grid-based representations), although simple, are widely appreciated for their accuracy and applicability in various contexts. A voxel (volumetric pixel) represents a single, discrete data point on a regular grid in the 3D space, and can contain multiple values in order to represent various properties of a certain portion of space in a simple and effective way.

The voxelised SES of proteins were computed with the region-growing Euclidean distance transform methodology described in our previous works [69, 70] at a resolution of 64 voxels per Å³, using a 1.4Å radius for the solvent probe. Patch centres are extracted from each protein surface uniformly and at a minimum separation of 1.8Å, while local surface patches are extracted using a sphere with a 6.0Å radius centred at each patch centre. This ensures that there is plenty overlap among patches with neighbouring centres. The 6.0Å patch radius is a recurring value in many algorithms which employ spherical patches [66, 68, 71–73], because it is an approximation of the radius of an amino acid [71]. The 3D Zernike Descriptors used in this work were computed up to a maximal order of 20, which corresponds a vector of 121 invariants per descriptor. 3DZDs of maximal order 20 have been shown to adequately capture shape complementarity at the protein–protein interface [66].

Interfacial regions of the protein surface

The recognition of PPI interface regions can be seen as a classification problem, i.e., each local surface patch is assigned to one of the two classes: interface surface patches, and non-interface surface patches. Consequently, the problem may be solved using statistical and machine learning techniques such as Support Vector Machines. A clear definition of interacting local surface patches is required in order to predict whether a given patch is involved in protein–protein interactions. However, many alternative definitions are being used to define an interaction site based on 3D structural data [74] which can be grouped into two main approaches: (i) inter-atomic distance between non-hydrogen atoms of different protein chains and (ii) change in accessible surface area (ASA) upon complex formation.

In this work, we used the following definition of interface and non-interface local surface patches. Let P₁ and P₂ be two proteins in a given complex whose 3D structure is known, and let SES(P₁) and SES(P₂) be the corresponding voxelised SES representations. The interface $I_{P_{1}}$ of protein P₁ is defined as the set of voxels from SES(P₁) which are within a 4.5Å distance from some heavy atom in P₂, i.e.:

$$ \begin{aligned} I_{P_{1}} = \lbrace \boldsymbol{v} \in SES(P_{1}) \mid & \exists\ \text{atom}\ a \in P_{2} \\ &\text{such that}\ d(\boldsymbol{v}, a) \leq 4.5 \text{\r{A}{}} \rbrace~. \end{aligned} $$

(1)

Equivalently, the interface $I_{P_{2}}$ of protein P₂ is defined as:

$$ \begin{aligned} I_{P_{2}} = \lbrace \boldsymbol{v} \in SES(P_{2}) \mid & \exists\ \text{atom}\ a \in P_{1} \\ &\text{such that}\ d(\boldsymbol{v}, a) \leq 4.5 \text{\r{A}{}} \rbrace \enspace. \end{aligned} $$

(2)

A patch is an interface patch if at least 80% of its surface voxels are located in the current protein’s interface, otherwise the patch is categorised as a non-interface patch.

Residue feature set

In order to reliably predict PPI interface residues, the physico-chemical characteristics (features) that can best discriminate between interacting and non-interacting sites must be identified. The choice of such features is critical for the success of a predictor [16]. The AAindex [75] is a database of numerical indices representing various physicochemical and biochemical properties of residues and residue pairs derived from published literature. An amino acid index is a set of 20 numerical values representing any of the different physicochemical and biological properties of each amino acid: the AAindex1 section of the database is a collection of 566 such indices (Release 9.2, February 2017). By using a consensus fuzzy clustering method on all available indices in the AAindex1, Saha et al. [58] identified three high quality subsets (HQIs) of all available indices (544 at the time), namely HQI8, HQI24 and HQI40. In this work we used the features of the HQI8 amino acid index set (see Table 1) which were identified as follows. Using the correlation coefficient between indices as a distance measure, Saha et al. divided all the available indices in the AAindex1 section into 8 clusters: the elements of the HQI8 subset consist of the medoids (centres) of these clusters.

Table 1 The HQI8 subset of amino acid indices from the AAindex database

Full size table

3D Zernike descriptors

The 3D Zernike descriptors (3DZD) were first used as a representation of the protein surface shape in [64], and have since been employed in several tasks such as global protein structure comparison [65], surface property comparison [67], local surface classification [76], binding ligand prediction by pocket-pocket similarity detection [77–79] and pocket-ligand complementarity evaluation [80, 81], and protein-protein docking prediction [66] with quite satisfactory results. 3DZDs present several advantages over other surface representations. For instance, they can represented protein surfaces and the corresponding properties very compactly as a vector of numbers. 3DZDs are invariant to rotations and translations, i.e. they are not affected by the initial orientation of the molecular surface. Because of this property, time-consuming spatial alignments of proteins are not required and the descriptors can be precomputed and stored. The 3DZDs can be computed for any 3D image, and are thus suitable for representing physico-chemical properties on the molecular surface as the electrostatic potential or the hydrophobicity [67]. Lastly, by changing the order of the series expansion, the resolution of the surface representation can be easily controlled.

Each patch of the enriched protein surface is represented by the 3D Zernike descriptors. The 3DZD are a series expansion of a 3D function which exhibit several desirable properties such as compactness of the representation, roto-translational invariance and minimum information redundancy (orthonormality). In what follows we will provide a brief description of the 3DZD. Refer to [82] for the exhaustive mathematical derivation and to [83] for the implementation details. The 3D Zernike functions $Z_{nl}^{m}$ of order n and repetition m are defined as

$$ Z_{nl}^{m}(r, \theta, \phi)=R_{nl}(r) \cdot Y_{l}^{m}(\theta, \phi) \enspace. $$

(3)

$Y_{l}^{m}(\theta, \phi)$ are the spherical harmonics in polar coordinates of l^th degree, where l ≤ n, m∈ {−l,−l+1,−l+2,…,l−1,l}, with n−l an even number. R_nl(r) are the radial polynomials of radius r which guarantee the orthonormality of the $Z_{nl}^{m}(r, \theta, \phi)$ polynomials in Cartesian coordinates. The expression of $Z_{nl}^{m}$ can be rewritten in Cartesian coordinates as a linear combination of monomials of order up to n:

$$ Z_{nl}^{m}(\boldsymbol{x}) = \sum\limits_{r+s+t \leq n} \chi_{nlm}^{rst} \cdot x^{r} y^{s} z^{t}~. $$

(4)

The 3D Zernike moments $\Omega _{nl}^{m}$ of function $f(\boldsymbol {x}), \boldsymbol {x}\in \mathbb {R}^{3}$ are defined as:

$$ \Omega_{nl}^{m} := \frac{3}{4\pi}\int_{\lvert\boldsymbol{x} \rvert\leq 1} f(\boldsymbol{x})\overline{\boldsymbol{Z}_{nl}^{m}(\boldsymbol{x})}d\boldsymbol{x}~. $$

(5)

Using Eq. 4, the 3D Zernike moments $\Omega _{nl}^{m}$ of an object can be written as a linear combination of geometric moments of order up to n

$$ \Omega_{nl}^{m}= \frac{3}{4\pi}\cdot\sum_{r+s+t\leq n}\overline{\chi_{nlm}^{rst}}\cdot M_{rst}~, $$

(6)

where M_rst is the geometric moment of the object scaled to fit in the unit ball

$$ M_{rst} = \int_{\lvert\boldsymbol{x} \rvert \leq 1} f(\boldsymbol{x}) \cdot x^{r} y^{s} z^{t} d\boldsymbol{x} \enspace, $$

(7)

where $\boldsymbol {x} \in \mathbb {R}^{3}$ is the vector $\boldsymbol {x} = \left (x, y, z\right)^{\intercal }$.

The 3D Zernike moments $\Omega _{nl}^{m}$ are not invariant under rotations. In order to achieve invariance, moments are collected into (2l+1)-dimensional vectors $\boldsymbol {\Omega }_{nl}=\left (\Omega _{nl}^{l}, \Omega _{nl}^{l-1}, \Omega _{nl}^{l-2}, \dots, \Omega _{nl}^{-l}\right)^{\intercal }$, and the rotationally invariant 3D Zernike descriptors F_nl are defined as norms of vectors Ω_nl:

$$ F_{nl} := \left\lVert\boldsymbol{\Omega}_{nl}\right\rVert~. $$

(8)

Given the maximum moment order N, the number of 3D Zernike descriptors can be easily determined by using the following formula:

$$ \text{No. 3DZDs}=\left\{ \begin{array}{ll} \left(\frac{N+2}{2} \right)^{2}, & \text{if \textit{N} is even}\\ \frac{\left(N+1\right)\left(N+3\right)}{4}, & \text{if \textit{N} is odd}~. \end{array} \right. $$

(9)

Patch representation using 3D Zernike descriptors

The physico-chemical and biochemical properties described in the HQI8 amino acid index set are mapped on the voxelised representation of the protein’s SES. Depending on the amino acid it belongs to, each atom in the protein is assigned the corresponding numeric values of the properties scaled by the atom’s radius. For a given amino acid index, each voxel in the protein’s SES is assigned the corresponding value of the atom occupying that voxel. If a voxel belongs to two or more atoms (i.e. if two or more atoms overlap), then the sum of the corresponding values of the overlapping atoms is assigned to that voxel. If a voxel does not belong to the SES of the current protein, its value is set to zero.

Eight 3D functions are thus defined, each describing one of the properties of the HQI8 set. For a given protein P, these functions are formally defined as follows. Let A_P be the set of atoms in the current protein P, and let $\Phi _{i}: A_{P} \rightarrow \mathbb {R}$ the function which assigns to each atom the numeric value of the corresponding amino acid for a given amino acid index i∈HQI8. Then, for a given amino acid index i∈HQI8, the corresponding property is mapped on the SES(P) according to the following 3D function:

(10)

where r_a is the radius of atom a, and is the indicator function for atom a defined as:

(11)

Zernike descriptors cannot be used to distinguish positive valued functions from negative valued ones (see the Additional file 1 for a concise mathematical justification). For instance, a surface patch with a certain charge distribution pattern would be indistinguishable from another patch with the same shape and inverted electrostatic charges in terms of 3DZDs. This can be avoided by considering a 3D function f(x) as the difference of its positive part f⁺(x)= max(f(x),0) with its negative part f⁻(x)=− min(f(x),0), i.e. f(x)=f⁺(x)−f⁻(x), and by computing the 3DZDs of these two functions separately.

Three of the amino acid indices in HQI8 can assume both positive and negative values, namely BLAM930101, BIOV880101 and MIYS990104, while the remaining five indices assume positive values only. The positive and negative parts were considered separately for these three indices, yielding a total of 11 3DZDs describing the HQI8 properties for each local surface patch. The maximal order 20 was used for the calculation of the 3DZDs, thus, according to Eq. 9, each patch is characterised with a total of 11×121=1331 features.

Support vector machine

Support vector machine (SVM) is a binary classification technique introduced by Vapnik et al. [84–86]. While traditional binary classification methods generally minimize the empirical training error, SVM minimizes the upper bound of the generalization error by maximizing the margin between the separating hyperplane and the data, abiding to the structure risk minimization principle for model selection. Striking feature of SVM is the property of compacting information contained in the training data, and providing a sparse representation even when using a small number of data points.

A binary classification problem usually involves separating data into training and test sets. The instances (samples) of the training set are the pairs (x_i,y_i), where x_i is a vector representing the features or attributes of the given sample and y_i∈{−1,+1} is the corresponding class label. The goal of SVM is to produce a model based on the training data which predicts the class labels of the test data given only the feature vectors of the test data. This is achieved by solving the following optimisation problem:

$$ \begin{aligned} \min_{\boldsymbol{w}, b, \boldsymbol{\xi}} \enspace & \frac{1}{2} \boldsymbol{w}^{\intercal} \boldsymbol{w} + C\sum_{i=1}^{l} \xi_{i} \\ \text{subject to} \enspace & y_{i}\left(\boldsymbol{w}^{\intercal} \phi(\boldsymbol{x}_{i}) + b \right) \geq 1 - \xi_{i}, \\ & \xi_{i} \geq 0, i=1, \dots, l ~, \end{aligned} $$

(12)

where ϕ(x_i) maps x_i into a higher-dimensional (and potentially even an infinite-dimensional) space, and C>0 is the penalty parameter of the error term. In practice the dual formulation of this problem is solved instead, due to high dimensionality of the vector variable w:

$$ \begin{aligned} \min_{\boldsymbol{\alpha}} \enspace & \frac{1}{2} \boldsymbol{\alpha}^{\intercal} y_{i} y_{j} \phi(\boldsymbol{x}_{i})^{\intercal} \phi(\boldsymbol{x}_{j}) \boldsymbol{\alpha} -\boldsymbol{e}^{\intercal} \boldsymbol{\alpha} \\ \text{subject to} \enspace & \boldsymbol{y}^{\intercal} \boldsymbol{\alpha} = 0, \\ & 0 \leq \alpha_{i} \leq C, i=1, \dots, l \enspace, \end{aligned} $$

(13)

where $\boldsymbol {e} = \left [1, 1, \dots, 1 \right ]^{\intercal }$ is the vector of all ones.

After solving the dual problem, the optimal w is given by

$$ \boldsymbol{w}=\sum\limits_{i=1}^{l} y_{i} \alpha_{i}\phi(\boldsymbol{x}_{i}) \enspace, $$

(14)

and by setting $K(\boldsymbol {x}_{i}, \boldsymbol {x}_{j}) = \phi (\boldsymbol {x}_{i})^{\intercal } \phi (\boldsymbol {x}_{j})$, the decision function is given by:

$$ \begin{aligned} f(\boldsymbol{x}) &= \text{sgn} \left(\boldsymbol{w}^{\intercal} \phi(\boldsymbol{x}) + b \right) \\ &= \text{sgn} \left(\sum\limits_{i=1}^{l}\ y_{i} \alpha_{i} K(\boldsymbol{x}_{i}, \boldsymbol{x}) + b \right)~. \end{aligned} $$

(15)

Please note that there is no need to compute the mapped feature vectors ϕ(x) explicitly. Instead, only the dot products between mapped feature vectors are calculated $K(\boldsymbol {x}_{i}, \boldsymbol {x}_{j}) = \phi (\boldsymbol {x}_{i})^{\intercal }\phi (\boldsymbol {x}_{j})$. K(x_i,x_j) is also known as kernel function.

SVM can perform non-linear classification in the feature space by finding a separating hyperplane with maximal margin in the higher dimensional space generated by ϕ(·). This is easily done by using different kernel functions generating ϕ(·). The most used kernels are given in Table 2. Although the performance of SVM mostly depends on the choice of an appropriate kernel function, there is no optimal way to choose an optimal kernel function within a data-driven approach.

Table 2 The four basic kernel functions

Full size table

In this work, interface local patch descriptors are labelled as positive samples (+1) and non-interface ones are labelled as negative samples (−1). Therefore, our interface recognition problem is actually a binary classification problem which can be handled by a SVM. In this work we used the SVM implementation provided in the scikit-learn Python module for machine learning version 0.18.1 [87].

Performance measures

The PPI interface prediction based on local surface patch descriptors is a binary classification problem, thus, a number of commonly used measures can be employed to evaluate the performance. These methods include accuracy (A), precision (P), recall (R), F₁ score (F₁) and the Matthews correlation coefficient (MCC) (see Table 3).

Table 3 Performance measures for the binary classification problem: TP – true positives, TN – true negatives, FP – false positives, FN – false negatives

Full size table

The Receiver Operating Characteristic (ROC) and the Precision–Recall (PR) curve plots and their Area Under the Curve (AUC) can also be used to assess the quality of a binary classifier. The ROC curve is the most commonly used way to visualize the performance of a binary classifier, and AUC is a very good way to summarize its performance in a single number. In this work, the ROC curve of an SVM classifier is created by plotting the True Positive Rate (the fraction of true positives out of the total predicted positives) against the False Positive Rate (the fraction of false positives out of the total predicted negatives), at various threshold values of the intercept term b in Eq. 15. The PR curve is obtained by plotting the precision values against the corresponding recall for all threshold values of b.

Dataset

The Protein–Protein Docking Benchmark 5.0 (DB5) [88] was used as dataset in this work. The benchmark consist of 230 non-redundant, high quality structures of protein–protein complexes along with the unbound structures of their components. Non-redundancy is set at the family level of SCOPe 2.03 [89]: two complexes were considered redundant when the pairs of interacting domains were the same at the SCOPe family level. Antibody–antigen complexes were considered redundant only when the SCOP families of the antigens were identical, and at least 80% of the antigen interface residues were shared between the two complexes. The complexes are divided into 8 different classes: (1) Antibody–Antigen (A), (2) Antigen–Bound Antibody (AB), (3) Enzyme–Inhibitor (EI), (4) Enzyme–Substrate (ES), (5) Enzyme complex with a regulatory or accessory chain (ER), (6) Others, G-protein containing (OG), (7) Others, Receptor containing (OR), and (8) Others, miscellaneous (OX). The complexes are further classified based on the conformational changes upon binding into three classes: (1) rigid-body, (2) medium difficulty and (3) difficult.

In order to assess the predictive capabilities of the proposed method on different protein complex classes, we considered the 8 different classes in the DB5 separately. For each class, we also separated the receptor proteins from the ligand ones, thus obtaining 16 separate datasets. We maintained the separation between classes A and AB, although not being biologically different, in order to be able to evaluate the performance variations due to conformational changes upon binding, as there are no unbound structures available for the receptor proteins in the AB class. For each of the 16 datasets, we further reduced redundancy to a maximum of 90% sequence identity between pairs of different (unbound) proteins with the CD-HIT tool [90, 91]. Each dataset was then randomly split into two disjoint sets: a training set of approximately 60% of the number of complexes and a test set of the remaining ∼ 40% (see Table 4).

Table 4 Training and test split for each of the 16 protein classes in the Protein–Protein Docking Benchmark 5.0

Full size table

The interaction interface generally corresponds to a small portion of a protein’s surface, thus, a uniform sampling of the protein surface into local surface patches results in a highly-imbalanced classification problem where the interface patches are the minority class. Most machine learning algorithms do not perform well when the number of instances of one class far exceeds the other, especially when classification accuracy is employed as a figure of merit. This can lead to classifiers that tend to label all the samples as belonging to the majority class, thus trivially obtaining a high accuracy measure.

In this work we used a combination of undersampling of the majority class and oversampling of the minority class in order to balance the training set. The surface of each protein in the training set was first sampled into local surface patches with a minimum separation of 4.5Å between patch centres. Then, only the interface regions were sampled with a minimum separation of 1.0Å between patch centres. This procedure yields more balanced training sets (see Table 5) and guarantees that both the interface and non-interface protein surface regions are sampled in a fairly uniform fashion. We also used the F₁ score (instead of classification accuracy) as a figure of merit during model evaluation on the training samples. The test samples, on the other hand, were obtained by uniformly sampling the surfaces of the proteins in the test set with a minimum separation of 1.8Å between patch centres, thus retaining the original distribution of positive and negative samples. Table 5 also reports the unbalanced version of the training set obtained with the same parameters.

Table 5 The number of interface (positive samples) and non-interface (negative samples) local surface patches in the balanced and unbalanced versions of the training set and in the test set for each protein class

Full size table

SVM model selection

Choosing an appropriate kernel function with the corresponding best hyper-parameters (which include the penalty C and the kernel parameters) is critical for achieving good classification performance with SVMs. Although grid-search is currently the most widely used method for hyper-parameter optimisation in learning algorithms, it can be prohibitively time-consuming since not all hyper-parameters are equally important to tune. Grid search experiments might end up allocating too many trials to the exploration of dimensions with low impact on the final performance and suffer from poor coverage of the more important ones. On the other hand, randomised search experiments were recently proven more efficient in several learning algorithms and datasets [92], and have thus been gaining popularity in several applications.

Feature selection was also performed (on the training samples only) in order to reduce the number of features to a subset of relevant ones, since its benefits are manifold (model simplification, shorter training times, better generalisation and avoiding curse of dimensionality) [93]. In this work, we employed a relatively novel feature selection procedure know as Randomized Logistic Regression [94]. This method works by sub-sampling the training data and fitting a L1-regularised Logistic Regression model where the penalty of a random subset of coefficients has been scaled. By performing this double randomization several times, the method assigns high scores to features that are repeatedly selected across randomizations (see the Additional file 1 for a more detailed description of the feature selection algorithm).

After the feature selection, we performed a randomized search over the hyper-parameters for each of the kernel functions described in Table 2: each parameter was sampled from either a distribution over possible values or a list of discrete choices. The penalty parameter C was sampled from the continuous exponential distribution with mean 2000 for all kernel functions. The γ parameter was sampled from the continuous exponential distribution with mean 0.01 for the polynomial, RBF and sigmoid kernel functions. The degree d parameter of the polynomial kernel was sampled from the discrete uniform distribution $\mathcal {U}\lbrace 2, 10 \rbrace $ (the polynomial kernel of degree 1 is actually the linear kernel), while the r parameter of the polynomial and sigmoid kernels was sampled from the continuous uniform distribution $\mathcal {U}\left (-2, 2 \right)$. The computation budget, i.e. the total number of sampled candidates or sampling iterations, was set to 200 iterations for each kernel function.

The hyper-parameter evaluation was carried out through leave-one-out cross-validation (LOOCV) at the protein level. If the training set consists of k proteins, in turn, each protein is removed from the training set, and a model is trained on the samples of the remaining k−1 proteins. The resulting model is then validated on the samples of the protein that was left out. The performance measure reported by LOOCV is then the average of the values computed in the loop. We used the F₁ score as a performance measure throughout all experiments.

Interface residue prediction

In order to predict the set of interface residues in a target protein the predicted interface surface patches must be mapped on the underlying residues. The mapping procedure can be summarized as follows. Each residue in the query protein is assigned an initial score of 0. Then, for each predicted interface surface patch we identify the set of its underlying residues, that is, all the residues with at least one atom within a 6Å distance from the patch centre. The score of each underlying residue is incremented by 1/(1+d), where d is the minimum distance from its atoms to the current patch centre. At the end of the procedure, each residue in the query protein will be assigned a score which indicates its likelihood of belonging to the PPI interface. Each residue can then be classified as interacting or non-interacting by thresholding on this score.

Results and discussion

Model selection results

Table 6 summarises the results of the feature selection procedure with the Randomized Logistic Regression algorithm, describing the number of selected features for each amino acid index, while Table 7 describes the best model chosen by the Randomized Search with leave-one-out cross-validation procedure (see Additional file 2 for the indices of the selected features for each protein class). A relatively small portion of the overall number of features (1331) are extracted for each protein class. This is probably due to the fact that we are mapping residue-wise properties on the molecular surface, and the resulting patterns that arise on the local surface patches are relatively simple. This means that only a few terms of the 3DZDs of order 20 (121-dimensional vectors) are required in order to capture such patterns, and thus distinguish between interface and non-interface surface patches.

Table 6 The number of selected features belonging to each physico-chemical property and for each protein class. The + and − signs indicate, respectively, the descriptors of the positive and negative parts of the corresponding amino acid index

Full size table

Table 7 The selected (best) SVM model for each protein class, i.e. the penalty C, the kernel function and its parameters (γ, d, r)

Full size table

It is also worth noticing that the number of selected features of a given amino acid index varies from one protein class to another. For instance, 24 features are selected for the NAKH920108 amino acid index property for the bound version of protein class A _r, while, for the bound version of protein class ES _l the algorithm selects no features at all for the same amino acid index property. This is consistent with the hypothesis that interfaces of proteins belonging to different classes and carrying different functions can vary widely. Moreover, the number of selected features for a given amino acid index can be used to measure its importance in the characterisation of the PPI interface in a given protein class. The Randomized Logistic Regression algorithm only selects important features which correlate with the classification labels: if few features of an amino acid index property are selected in a given protein class, it means that the given property is not important in discriminating interface from non-interface surface regions for the current class. On the other hand, since the classification labels depend on the selected features, key properties which drive protein interactions in the current protein class will have many of their features selected by the algorithm.

Prediction results on the test set

The performance results for the proposed methodology at the surface patch level on the test set are presented in Table 8 (see Additional file 3 for the prediction results on the test set for each protein). Figure 1 describes the Receiver Operating Characteristic curve for each protein class. The performance of the proposed methodology varies widely from one protein class to another: from a very high AUC-ROC of 94% for class A _r (95.4% in the bound case) to a much less satisfactory prediction for class A _l (in both bound and unbound cases). The effect of the conformational changes proteins undergo upon binding can be observed in the differences between the obtained performance values for the bound and unbound versions of the protein classes: for most protein classes, the bound versions obtain better prediction results than the unbound ones.

Table 8 Mean and standard deviation (in parentheses) measures of F₁ score, classification accuracy, precision, recall, MCC and ROC-AUC obtained on the test set at the local surface patch level using the corresponding best SVM model

Full size table

To investigate the reasons behind the different performance rates achieved for different protein classes, we measured the average pairwise sequence identity for each protein class (see Table 9, we excluded the pairwise sequence identity measures for chains within the same protein). No particular correlation emerges between the classification performance at the patch level and the average pairwise sequence identity of the different protein classes. For instance, the average pairwise sequence identity in the unbound version of protein class A _l is 41.52%, which is higher than in some other classes. However, we achieve the lowest classification performance in this class. For this reason we conclude that the performance discrepancies are due to the varying capability of the HQI8 index to adequately represent the diverse interaction patterns that characterise PPIs in the different protein classes.

Table 9 Average pairwise sequence identity (in %) for each protein class

Full size table

In order to further demonstrate the necessity of developing class-specific protein interface predictors, we trained a generic SVM model based on all the structures in the training set, only differentiating between bound and unbound structures, and evaluated its performance on the test set structures for each protein class. The comparison of the average ROC curves of the class-specific and generic models are given in Fig. 1 for each protein class. In general, the class-specific models obtain better classification performance in terms of ROC-AUC, especially for the bound versions of protein classes A _r, AB _r, OG _r and the unbound versions of protein classes A _r, AB _r, AB _l, EI _r, ER _l, ES _r, ES _l. Interestingly enough, the class-specific and generic models both obtain very similar results in classes OG, OR and OX (except for the bound version of OG _r). These are the most generic classes in DB5 (i.e. Others, G-protein containing (OG), Others, Receptor containing (OR) and Others, miscellaneous (OX)), thus the benefits of using a class-specific training set are less evident.

Post-processing

By analysing the results in Table 8 we noticed that for some protein classes the prediction performance in terms of ROC-AUC and recall was high while the other prediction metrics were low. This is due to the fact that the default threshold (t=0) used by the SVM classifier (on the b term in Eq. 15) does not yield optimal binary classification results, since the employed training set is balanced and does not reflect the natural distribution of interface and non-interface patches. We selected the best threshold value that maximises the average F₁ score on the training set proteins for each protein class: we used the unbalanced version of the training set for each protein class for this task. The best SVM threshold values obtained for each protein class are reported in the Additional file 1.

Interface regions usually consist of continuous portions of the protein surface. For this reason, the spatial relations among the predicted interface patch centres can be exploited in order to reduce the number of false positive local surface patches. This can be achieved by retaining predicted surface patches which form continuous clusters on the protein surface while discarding the spatially isolated ones. The Isolation Forest (IF) algorithm for outlier detection [95] was used to reduce the number of spatially-isolated false positive local surface patches. Interface regions are composed of contiguous surface patches, thus isolated patches marked as positive by the SVM classifier can be safely discarded. For each query protein, an IF classifier is trained on the coordinates of the LSPs identified as interface patches by the SVM classifier, using their distances from the separating hyperplane as weights. Then, the IF classifier is used on the whole set of surface patches of the query protein to identify the ones belonging to the PPI interface. A contamination parameter must be provided to the IF algorithm: we identified the optimal parameter values for each protein class by testing all contamination values from 0.00 to 0.5 with a constant increment of 0.01, and selected the ones that yielded the best average F₁ score on the training set of the corresponding protein class. Because the IF for outlier detection is a random algorithm, the F₁ score was averaged over 100 runs for each contamination value. When the best average F₁ score was obtained for a contamination value equal to zero we skipped the IF step. The best contamination values obtained for each protein class are reported in the Additional file 1.

Comparison with other methods

Homology-based (or template-based) approaches constitute the best performing PPI interface prediction methods to date (given the availability of close homologous structures) [16]. These methods infer the biological properties of a query protein from its homologs based on the assumption that homologs share significant similarity in sequence, structure and functional sites. For this reason, in order to assess the prediction capabilities of the proposed methodology, we compared it with two state-of-the-art homology-based PPI interface prediction algorithms: NPS-HomPPI [96] and PrISE [97], and with the well-known structure-based approach SPPIDER [30, 74]. NPS-HomPPI infers interfacial residues for a query protein from the interfacial residues of its homologs. Based on interface conservation thresholds derived from a systematic interface conservation analysis, NPS-HomPPI classifies the templates into either Safe, Twilight or Dark Zone, and uses multiple templates from the best available zone to infer interfaces for query proteins. PrISE is a family of local structural similarity-based computational methods for predicting PPI interface residues. For each target residue in a query protein structure, the spatial neighbours of the target are extracted and represented by their atomic composition and accessible surface areas. PrISE then searches its pre-calculated database for similar structural elements with experimentally determined interface information, and weights them according to their similarity with the structural element of the query protein. SPPIDER is a consensus method that combines the output of 10 Neural Networks using the majority voting. It uses the difference between the predicted and the actual rASA in an unbound structure of a residue as a feature (fingerprint) to predict interfaces.

The assessment was carried out on the structures of the test set described in Table 4, and the performance evaluation was done separately for each protein class. We used the following common definition of the PPI interface for all methods: a residue is considered as interfacial if at least one of its heavy atoms is within a 5Å distance from any other heavy atom of the interacting protein. When possible (i.e. for NPS-HomPPI and PrISE), the interface definition parameter was set accordingly. In the homology-based methods, all homologous structures with sequence identity with the query protein of 90% or above were not considered. We also required the predictions to be expressed as scores or probabilities estimating the likelihood of a residue being in the interface. The default settings were used for all the remaining parameters.

By thresholding on the residue score values we computed the average ROC curves and average Precision-Recall curves for each method which are shown in Figs. 2 and 3 respectively. The proposed methodology outperforms the competitor predictors in both the bound and unbound versions of protein classes A _r and AB _r: the ROC-AUC and PR-AUC values obtained by our predictor are significantly higher than the others. In the unbound version of protein class A _r, our method achieves a ROC-AUC of 94.2% and a PR-AUC of 67.7% while, for the competitors, the maximum ROC-AUC is 78.0% (for NPS-HomPPI) and the maximum PR-AUC is 12.4% (for SPPIDER). Similarly, in the bound version of protein class A _r, our method achieves a ROC-AUC of 95.4% and a PR-AUC of 56.4% while, for the competitors, the maximum ROC-AUC is 79.6% (for NPS-HomPPI) and the maximum PR-AUC is 12.2% (for SPPIDER). In the unbound version of AB _r, the proposed method achieves a ROC-AUC of 84.0% and a PR-AUC of 39.2% while the maximum ROC-AUC for the competitors is 78.9% (for PrISE) and the maximum PR-AUC is 13.5% (for SPPIDER). For the bound version of AB _r our method obtains a ROC-AUC score of 81.3% and a PR-AUC of 33.5%. The best ROC-AUC obtained by the competitors in the same class is 77.6% (for PrISE) and the best PR-AUC is 14.6% (for SPPIDER). Noticeably better prediction performance is also achieved in the unbound and bound versions of class EI _r: the achieved ROC-AUC values are 74.4% for the unbound and 75.5% for the bound version, while the achieved PR-AUC values are 33.2% for the unbound and 34.2% for the bound version. Although PrISE obtained the same ROC-AUC in the bound version of EI _r, the corresponding PR-AUC is only 27.0%. Slightly better than average prediction results were also obtained in the unbound versions of classes AB _l, OG _l, OX _l and in the bound version of class EI _l. Our prediction method underperformed compared to the competitors in the unbound versions of classes ES _r, ER _l and in the bound versions of classes OR _r, OR _l, OG _l. In all other protein classes the prediction capabilities of the proposed methodology followed the average trend of the competitor methods.

The obtained results agree with the initial hypothesis that proteins belonging to different classes exhibit diverse interaction mechanisms. To this end, the choice of a correct set of physico-chemical and biochemical properties characterising the interaction site is crucial, although it might not be possible to identify a comprehensive set of features that works well for all protein classes as the recognition patterns can be very different. Our results suggest that, although a given set of features can effectively discriminate between interface and non-interface surface regions for a given protein class, it can perform very poorly when used on other protein classes. Interface prediction could be further improved by better feature representation and selection methods that can effectively capture complex protein recognition patterns in diverse types of interactions, however, different protein classes should be treated separately.

The HQI8 amino acid index set of physico-chemical and biochemical properties showed very good discriminative capabilities for the interface recognition of some protein classes (A _r, AB _r) while under-performing in others (A _l). Other sets of properties could exhibit better discriminative power in the protein classes where we obtained low prediction performances. Ideally, an optimal set of features could be selected for each protein class in order to correctly identify the class-specific PPI patterns. The proposed methodology can be easily extended to other sets of amino acid properties, which can be similarly mapped on the voxelised protein surface and represented by 3DZDs.

Although not considered in this work, binding partner specificity has recently been reported to greatly affect the quality of predicted PPI interfaces [16], especially in transient protein interactions [96]. Partner-specific interface prediction methods have been shown to outperform several state-of-the-art non-partner-specific ones [22, 34, 98], and apparently, specific interacting partners should be considered in order to reliably predict interface regions. The methodology introduced in this work could be extended in order to predict pairs of interacting local surface patches by feeding the SVM classifier with the concatenation of the corresponding descriptors. However, class imbalance should be handled very carefully, as the number of negative samples (non-interacting patch pairs) would significantly increase with respect to the non-parter-specific case, while the number of positive samples (interacting patch pairs) would roughly remain the same as in the previous case.

Conclusions

Existing structure-based PPI interface predictors employ 3D structural information to encode statistical properties of surface patches as input feature vectors for binary classifiers while information about the spatial arrangement of atoms and residues is usually ignored. In this study we introduced a novel method for the prediction of PPI interface regions based on 3D Zernike descriptors, HQI8 amino acid index set and SVMs. We demonstrated that 3D Zernike Descriptors of physico-chemical and biochemical amino acid properties mapped on local patches of the protein surface can be used to characterise the latter in order to distinguish between interface and non-interface regions. The 3DZDs are able to capture the similarity among patterns of physico-chemical and biochemical properties mapped on the protein surface arising from the various spatial arrangements of the underlying residues. It is also worth noticing that this is the first time the physico-chemical and biochemical properties of the HQI8 set were mapped directly onto the 3D representation of the protein surface instead of being used to characterise the protein sequence.

This method was tested on 16 protein classes extracted from the Protein–Protein Docking Benchmark 5.0, on both the bound and unbound versions, and was compared with three other state-of-the-art PPI interface predictors, namely SPPIDER, PrISE and NPS-HomPPI. With a resulting ROC-AUC of, respectively, over 94% and 81%, we obtained very good classification results on protein classes A _r and AB _r (i.e. antibodies) and even outperformed the competitors also in terms of precision–recall. These results are very encouraging, thus we are planning to develop a specific antigen-binding interface (also known as paratope) prediction method for antibodies with known structure using the 3D Zernike descriptors and the HQI8 amino acid index set. The field of paratope residue prediction appears to be somewhat underdeveloped, with a general paucity of specific predictors, thus any future development in this direction should be quite promising.

Our results show that the choice of a proper set of features characterising the protein interface is crucial for the interface prediction task, and that the optimal set of features strongly depends on the specific protein class. For further improvement of prediction performance, it is necessary to identify an optimal set of features for each protein class or interaction type. As a future development, we plan to test several sets of features on different protein classes in order to widen the predictive capabilities of the proposed method. Including informations regarding possible binding partners in the prediction procedure is also expected to increase the overall performance, although tackling the resulting class imbalance will not be trivial.

The comparison of the class-specific interface prediction models with a generic one, trained on all training samples regardless of the protein class, also confirmed the hypothesis that interface prediction model development should be carried separately for different protein classes. In a certain way, this is similar to the homology-based interface prediction approach: for a given query protein, its closest homolog proteins with known binding sites are retrieved, and the query’s binding site is determined by comparison with the known structures. However, these methods cannot yield good predictions when adequate homologs are not available. Similarly to homology-based predictors, the proposed method requires the availability of several protein classes in order to reliably predict interface regions. Given the ever increasing number of available high-resolution 3D protein structures in public repositories, we expect that more benchmark sets and databases such as the Protein–Protein Docking Benchmark 5.0 which classify proteins into biologically relevant classes will be available. As a future work, we will expand the proposed methodology to other protein classes in order to increase the coverage of its predicting capabilities to as many proteins as possible. A pre-processing step for the determination of the most adequate protein class will be required for proteins with unknown classes. The predictor will then be made available to users throughout a dedicated web server.

The majority of the available protein interface identification methods make predictions at the residue resolution level. Protein–protein docking algorithms, however, require high-resolution atomic level knowledge in order to correctly predict native binding configurations between interacting proteins. The predictions at the local surface patch level can be readily used to guide protein–protein docking methods by limiting the docking search space to the sole surface patches which were predicted as belonging to the interaction interface. Docking algorithms based on local surface descriptor matching can greatly benefit from the proposed approach since this will sensibly limit the number of candidate patch pairs to be evaluated, thus reducing the conformational search space, and consequently reducing both the number of false positives and the required calculation time.

References

Berggård T, Linse S, James P. Methods for the detection and analysis of protein–protein interactions. Proteomics. 2007; 7(16):2833–42.
Article PubMed CAS Google Scholar
Keskin O, Tuncbag N, Gursoy A. Predicting protein–protein interactions from the molecular to the proteome level. Chem Rev. 2016; 116(8):4884–909.
Article CAS PubMed Google Scholar
Xu W, Weissmiller AM, White JA, Fang F, Wang X, Wu Y, Pearn ML, Zhao X, Sawa M, Chen S, et al.Amyloid precursor protein–mediated endocytic pathway disruption induces axonal dysfunction and neurodegeneration. J Clin Investig. 2016; 126(5):1815–33.
Article PubMed PubMed Central Google Scholar
Liyasova MS, Ma K, Lipkowitz S. Molecular pathways: Cbl proteins in tumorigenesis and antitumor immunity–opportunities for cancer treatment. Clin Cancer Res. 2015; 21(8):1789–94.
Article CAS PubMed Google Scholar
Rask-Andersen M, Almén MS, Schiöth HB. Trends in the exploitation of novel drug targets. Nat Rev Drug Discov. 2011; 10(8):579–90.
Article CAS PubMed Google Scholar
Li B, Kihara D. Protein docking prediction using predicted protein–protein interface. BMC Bioinformatics. 2012; 13(1):7.
Article PubMed PubMed Central CAS Google Scholar
Xue LC, Jordan RA, Yasser EM, Dobbs D, Honavar V. DockRank: Ranking docked conformations using partner-specific sequence homology-based protein interface prediction. Proteins Struct Funct Bioinformatics. 2014; 82(2):250–67.
Article CAS Google Scholar
Xue LC, Rodrigues JP, Dobbs D, Honavar V, Bonvin AM. Template-based protein–protein docking exploiting pairwise interfacial residue restraints. Brief Bioinform. 2017; 18(3):458–66.
PubMed Google Scholar
Kobe B, Guncar G, Buchholz R, Huber T, Maco B, Cowieson N, Martin JL, Marfori M, Forwood JK. Crystallography and protein–protein interactions: biological interfaces and crystal contacts.London: Portland Press Limited; 2008.
Google Scholar
Shi Y. A glimpse of structural biology through X-ray crystallography. Cell. 2014; 159(5):995–1014.
Article CAS PubMed Google Scholar
O’Connell MR, Gamsjaeger R, Mackay JP. The structural analysis of protein–protein interactions by NMR spectroscopy. Proteomics. 2009; 9(23):5224–32.
Article PubMed CAS Google Scholar
Callaway E. The revolution will not be crystallized: a new method sweeps through structural biology. Nature. 2015; 525(7568):172. https://doi.org/10.1038/525172a.
Article CAS PubMed Google Scholar
Simões IC, Costa IP, Coimbra JT, Ramos MJ, Fernandes PA. New parameters for higher accuracy in the computation of binding free energy differences upon Alanine Scanning Mutagenesis on protein–protein interfaces. J Chem Inf Model. 2016; 57(1):60–72.
Article PubMed CAS Google Scholar
Li J, Wei H, Krystek Jr SR, Bond D, Brender TM, Cohen D, Feiner J, Hamacher N, Harshman J, Huang R, et al.Mapping the Energetic Epitope of an Antibody/Interleukin-23 Interaction with Hydrogen/Deuterium Exchange, Fast Photochemical Oxidation of Proteins Mass Spectrometry, and Alanine Shave Mutagenesis. Anal Chem. 2017; 89(4):2250.
Article CAS PubMed Google Scholar
Schweppe DK, Chavez JD, Lee CF, Caudal A, Kruse SE, Stuppard R, Marcinek DJ, Shadel GS, Tian R, Bruce JE. Mitochondrial protein interactome elucidated by chemical cross-linking mass spectrometry. Proc Natl Acad Sci. 2017; 114(7):1732–7.
Article CAS PubMed PubMed Central Google Scholar
Xue LC, Dobbs D, Bonvin AM, Honavar V. Computational prediction of protein interfaces: A review of data driven methods. FEBS Lett. 2015; 589(23):3516–26.
Article CAS PubMed PubMed Central Google Scholar
Maheshwari S, Brylinski M. Predicting protein interface residues using easily accessible on-line resources. Brief Bioinform. 2015; 16(6):1025–34.
Article PubMed Google Scholar
Esmaielbeiki R, Krawczyk K, Knapp B, Nebel JC, Deane CM. Progress and challenges in predicting protein interfaces. Brief Bioinform. 2016; 17(1):117–31.
Article PubMed Google Scholar
Yan C, Dobbs D, Honavar V. A two-stage classifier for identification of protein–protein interface residues. Bioinformatics. 2004; 20(suppl 1):371–8.
Article CAS Google Scholar
Šikić M, Tomić S, Vlahoviček K. Prediction of protein–protein interaction sites in sequences and 3D structures by random forests. PLoS Comput Biol. 2009; 5(1):1000278.
Article CAS Google Scholar
Murakami Y, Mizuguchi K. Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein–protein interaction sites. Bioinformatics. 2010; 26(15):1841–8.
Article CAS PubMed Google Scholar
Ahmad S, Mizuguchi K. Partner-aware prediction of interacting residues in protein–protein complexes from sequence data. PLoS ONE. 2011; 6(12):29104.
Article CAS Google Scholar
Sriwastava BK, Basu S, Maulik U, Plewczynski D. PPIcons: Identification of protein–protein interaction sites in selected organisms. J Mol Model. 2013; 19(9):4059–70.
Article CAS PubMed PubMed Central Google Scholar
Chen X-w, Jeong JC. Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics. 2009; 25(5):585–91.
Article PubMed CAS Google Scholar
Garcia-Garcia J, Valls-Comamala V, Guney E, Andreu D, Muñoz FJ, Fernandez-Fuentes N, Oliva B. iFraG: A protein–protein interface prediction server based on sequence fragments. J Mol Biol. 2017; 429(3):382–9.
Article CAS PubMed Google Scholar
Berman HM, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nat Struct Mol Biol. 2003; 10(12):980. https://doi.org/10.1038/nsb1203-980.
Article CAS Google Scholar
Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R. The PeptideAtlas project. Nucleic Acids Res. 2006; 34(suppl 1):655–8. https://doi.org/10.1093/nar/gkj040.
Article CAS Google Scholar
Craig R, Cortens JP, Beavis RC. Open source system for analyzing, validating, and storing protein identification data. J Proteome Res. 2004; 3(6):1234–42. https://doi.org/10.1021/pr049882h.
Article CAS PubMed Google Scholar
Vizcaíno JA, Côté RG, Csordas A, Dianes JA, Fabregat A, Foster JM, Griss J, Alpi E, Birim M, Contell J, O’Kelly G, Schoenegger A, Ovelleiro D, Pérez-Riverol Y, Reisinger F, Ríos D, Wang R, Hermjakob H. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 2013; 41(Database issue):1063–9. https://doi.org/10.1093/nar/gks1262.
Google Scholar
Porollo A, Meller J. Prediction-based fingerprints of protein–protein interactions. Proteins Struct Funct Bioinforma. 2007; 66(3):630–45.
Article CAS Google Scholar
Kufareva I, Budagyan L, Raush E, Totrov M, Abagyan R. PIER: protein interface recognition for structural proteomics. Proteins Struct Funct Bioinforma. 2007; 67(2):400–17.
Article CAS Google Scholar
Shoemaker BA, Zhang D, Thangudu RR, Tyagi M, Fong JH, Marchler-Bauer A, Bryant SH, Madej T, Panchenko AR. Inferred Biomolecular Interaction Server–a web server to analyze and predict protein interacting partners and binding sites. Nucleic Acids Res. 2009; 38:842.
Google Scholar
Zhang QC, Deng L, Fisher M, Guan J, Honig B, Petrey D. PredUs: A web server for predicting protein interfaces using structural neighbors. Nucleic Acids Res. 2011; 39(suppl 2):283–7.
Article CAS Google Scholar
Minhas A, ul Amir F, Geiss BJ, Ben-Hur A. PAIRpred: Partner-specific prediction of interacting residues from sequence and structure. Proteins Struct Funct Bioinforma. 2014; 82(7):1142–55.
Article CAS Google Scholar
Neuvirth H, Raz R, Schreiber G. ProMate: a structure based prediction program to identify the location of protein–protein binding sites. J Mol Biol. 2004; 338(1):181–99.
Article CAS PubMed Google Scholar
Melo R, Fieldhouse R, Melo A, Correia JD, Cordeiro MND, Gümüş ZH, Costa J, Bonvin AM, Moreira IS. A machine learning approach for hot-spot detection at protein–protein interfaces. Int J Mol Sci. 2016; 17(8):1215.
Article PubMed Central CAS Google Scholar
Zinzalla G, Thurston DE. Targeting protein–protein interactions for therapeutic intervention: a challenge for the future. Future Med Chem. 2009; 1(1):65–93. https://doi.org/10.4155/fmc.09.12.
Article CAS PubMed Google Scholar
Ma B, Elkayam T, Wolfson H, Nussinov R. Protein–protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc Natl Acad Sci. 2003; 100(10):5772–7.
Article CAS PubMed PubMed Central Google Scholar
Yan C, Wu F, Jernigan RL, Dobbs D, Honavar V. Characterization of protein–protein interfaces. Protein J. 2008; 27(1):59–70.
Article CAS PubMed PubMed Central Google Scholar
Keskin O, Gursoy A, Ma B, Nussinov R. Principles of protein–protein interactions: What are the preferred ways for proteins to interact?Chem Rev. 2008; 108(4):1225–44.
Article CAS PubMed Google Scholar
Haspel N, Jagodzinski F. Methods for Detecting Critical Residues in Proteins In: Reeves A, editor. In Vitro Mutagenesis, Methods in Molecular Biology, vol. 1498. New York: Humana Press: 2017. p. 227–42. https://doi.org/10.1007/978-1-4939-6472-7_15.
Google Scholar
Crowley PB, Golovin A. Cation– π interactions in protein–protein interfaces. Proteins Struct Funct Bioinforma. 2005; 59(2):231–9.
Article CAS Google Scholar
Ponstingl H, Kabir T, Gorse D, Thornton JM. Morphological aspects of oligomeric protein structures. Prog Biophys Mol Biol. 2005; 89(1):9–35.
Article CAS PubMed Google Scholar
Bahadur RP, Chakrabarti P, Rodier F, Janin J. Dissecting subunit interfaces in homodimeric proteins. Proteins Struct Funct Bioinforma. 2003; 53(3):708–19.
Article CAS Google Scholar
Ozbabacan SEA, Engin HB, Gursoy A, Keskin O. Transient protein–protein interactions. Protein Eng Design Select. 2011; 24(9):635–48.
Article CAS Google Scholar
Mintseris J, Weng Z. Structure, function, and evolution of transient and obligate protein–protein interactions. Proc Natl Acad Sci USA. 2005; 102(31):10930–5.
Article CAS PubMed PubMed Central Google Scholar
Nooren IM, Thornton JM. Diversity of protein–protein interactions. EMBO J. 2003; 22(14):3486–92.
Article CAS PubMed PubMed Central Google Scholar
De S, Krishnadev O, Srinivasan N, Rekha N. Interaction preferences across protein–protein interfaces of obligatory and non-obligatory components are different. BMC Struct Biol. 2005; 5(1):15.
Article PubMed PubMed Central CAS Google Scholar
Jones S, Thornton JM. Principles of protein–protein interactions. Proc Natl Acad Sci. 1996; 93(1):13–20.
Article CAS PubMed PubMed Central Google Scholar
Cho K-I, Lee K, Lee KH, Kim D, Lee D. Specificity of molecular interactions in transient protein–protein interaction interfaces. Proteins Struct Funct Bioinforma. 2006; 65(3):593–606.
Article CAS Google Scholar
Ofran Y, Rost B. Analysing six types of protein–protein interfaces. J Mol Biol. 2003; 325(2):377–87.
Article CAS PubMed Google Scholar
Krawczyk K, Baker T, Shi J, Deane CM. Antibody i-Patch prediction of the antibody binding site improves rigid local antibody–antigen docking. Protein Eng Des Select. 2013; 26(10):621–9.
Article CAS Google Scholar
Gao YF, Li BQ, Cai YD, Feng KY, Li ZD, Jiang Y. Prediction of active sites of enzymes by maximum relevance minimum redundancy (mRMR) feature selection. Mol BioSyst. 2013; 9(1):61–9.
Article CAS PubMed Google Scholar
Izidoro SC, de Melo-Minardi RC, Pappa GL. GASS: identifying enzyme active sites with genetic algorithms. Bioinformatics. 2015; 31(6):864–70.
Article CAS PubMed Google Scholar
Dalkas GA, Rooman M. SEPIa, a knowledge-driven algorithm for predicting conformational B-cell epitopes from the amino acid sequence. BMC Bioinformatics. 2017; 18(1):95.
Article PubMed PubMed Central Google Scholar
Jespersen MC, Peters B, Nielsen M, Marcatili P. BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 2017; 45(W1):W24–9.
Article CAS PubMed PubMed Central Google Scholar
Kunik V, Ashkenazi S, Ofran Y. Paratome: an online tool for systematic identification of antigen-binding regions in antibodies based on sequence or structure. Nucleic Acids Res. 2012; 40(W1):521–4.
Article CAS Google Scholar
Saha I, Maulik U, Bandyopadhyay S, Plewczynski D. Fuzzy clustering of physicochemical and biochemical properties of amino acids. Amino Acids. 2012; 43(2):583–94.
Article CAS PubMed Google Scholar
Lv H, Han J, Liu J, Zheng J, Liu R, Zhong D. CarSPred: a computational tool for predicting carbonylation sites of human proteins. PloS ONE. 2014; 9(10):111478.
Article CAS Google Scholar
Sriwastava BK, Basu S, Maulik U. Protein–protein interaction site prediction in Homo sapiens and E. coli using an interaction-affinity based membership function in fuzzy SVM. J Biosci. 2015; 40(4):809–18.
Article CAS PubMed Google Scholar
Du X, Sun S, Hu C, Li X, Xia J. Prediction of protein–protein interaction sites by means of ensemble learning and weighted feature descriptor. J Biol Res Thessaloniki. 2016; 23(1):10.
Article Google Scholar
Ismail HD, Newman RH, et al.RF-Hydroxysite: a random forest based predictor for hydroxylation sites. Mol BioSyst. 2016; 12(8):2427–35.
Article CAS PubMed PubMed Central Google Scholar
Wang X, Yan R, Li J, Song J. SOHPRED: a new bioinformatics tool for the characterization and prediction of human S-sulfenylation sites. Mol BioSyst. 2016; 12(9):2849–58.
Article CAS PubMed Google Scholar
Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D. Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins Struct Funct Bioinf. 2008; 72(4):1259–73. https://doi.org/10.1002/prot.22030.
Article CAS Google Scholar
Venkatraman V, Sael L, Kihara D. Potential for protein surface shape analysis using spherical harmonics and 3D Zernike descriptors. Cell Biochem Biophys. 2009; 54(1-3):23–32.
Article CAS PubMed Google Scholar
Venkatraman V, Yang Y, Sael L, Kihara D. Protein–protein docking using region-based 3D Zernike descriptors. BMC Bioinform. 2009; 10(1):407. https://doi.org/10.1186/1471-2105-10-407.
Article CAS Google Scholar
Sael L, La D, Li B, Rustamov R, Kihara D. Rapid comparison of properties on protein surface. Proteins. 2008; 73(1):1–10. https://doi.org/10.1002/prot.22141.
Article CAS PubMed PubMed Central Google Scholar
Connolly ML. Analytical molecular surface calculation. J Appl Crystallogr. 1983; 16(5):548–58. https://doi.org/10.1107/S0021889883010985.
Article CAS Google Scholar
Daberdaku S, Ferrari C. Computing discrete fine-grained representations of protein surfaces In: Angelini C, Rancoita PM, Rovetta S, editors. Computational Intelligence Methods for Bioinformatics and Biostatistics - 12th International Meeting, CIBB 2015, Naples, Italy, September 10-12, 2015, Revised Selected Papers. Lecture Notes in Bioinformatics. Cham: Springer: 2016. p. 180–95. https://doi.org/10.1007/978-3-319-44332-4_14.
Google Scholar
Daberdaku S, Ferrari C. Computing voxelised representations of macromolecular surfaces: A parallel approach. Int J High Perform Comput Appl. 2016. https://doi.org/10.1177/1094342016647114.
Wolfson H, Nussinov R. From computer vision to protein structure and association. New Compr Biochem. 1998; 32:313–34.
Article CAS Google Scholar
Duhovny D, Nussinov R, Wolfson HJ. Efficient unbound docking of rigid molecules In: Guigó R, Gusfield D, editors. Algorithms in Bioinformatics: Second International Workshop, WABI 2002 Rome, Italy, September 17–21, 2002 Proceedings. Berlin: Springer: 2002. p. 185–200. https://doi.org/10.1007/3-540-45784-4_14.
Google Scholar
Schneidman-Duhovny D, Inbar Y, Polak V, Shatsky M, Halperin I, Benyamini H, Barzilai A, Dror O, Haspel N, Nussinov R, et al.Taking geometry to its edge: fast unbound rigid (and hinge-bent) docking. Proteins Struct Funct Bioinforma. 2003; 52(1):107–12.
Article CAS Google Scholar
Porollo A, Meller J. Computational methods for prediction of protein–protein interaction sites. Protein-Protein Interact Comput Exp Tools. 2012; 472:3–26.
Google Scholar
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008; 36(suppl 1):202–5.
Google Scholar
Sael L, Kihara D. Characterization and classification of local protein surfaces using self-organizing map. Int J Knowl Discov Bioinforma. 2010; 1(1):32–47. https://doi.org/10.4018/jkdb.2010100203.
Article Google Scholar
Sael L, Kihara D. Binding ligand prediction for proteins using partial matching of local surface patches. Int J Mol Sci. 2010; 11(12):5009–26.
Article CAS PubMed PubMed Central Google Scholar
Sael L, Kihara D. Detecting local ligand-binding site similarity in nonhomologous proteins by surface patch comparison. Proteins Struct Funct Bioinforma. 2012; 80(4):1177–95.
Article CAS Google Scholar
Zhu X, Xiong Y, Kihara D. Large-scale binding ligand prediction by improved patch-based method Patch-Surfer 2.0. Bioinformatics. 2015; 31(5):707–13. https://doi.org/10.1093/bioinformatics/btu724.
Article CAS PubMed Google Scholar
Hu B, Zhu X, Monroe L, Bures MG, Kihara D. PL-PatchSurfer: a novel molecular local surface-based method for exploring protein–ligand interactions. Int J Mol Sci. 2014; 15(9):15122. https://doi.org/10.3390/ijms150915122.
Article CAS PubMed PubMed Central Google Scholar
Shin WH, Bures MG, Kihara D. PatchSurfers: Two methods for local molecular property-based binding ligand prediction. Methods. 2016; 93:41–50.
Article CAS PubMed Google Scholar
Canterakis N. 3D Zernike moments and Zernike affine invariants for 3D image analysis and recognition In: Ersbøll BK, Johansen P, editors. 11th Scandinavian Conference on Image Analysis. Kangerlussuaq: Dansk Selskab for Automatisk Genkendelse af Mønstre: 1999. p. 85–93.
Google Scholar
Novotni M, Klein R. Shape retrieval using 3D Zernike descriptors. Computer-Aided Des. 2004; 36(11):1047–62.
Article Google Scholar
Boser BE, Guyon IM, Vapnik VN. A training algorithm for Optimal Margin Classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. COLT ’92. New York: ACM: 1992. p. 144–52. https://doi.org/10.1145/130385.130401. http://doi.acm.org/10.1145/130385.130401.
Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995; 20(3):273–97.
Google Scholar
Schiolkopf B, Burges C, Vapnik V. Extracting support data for a given task. In: Proceedings, First International Conference on Knowledge Discovery & Data Mining. Menlo Park: AAAI Press: 1995. p. 252–7.
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011; 12:2825–30.
Google Scholar
Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, Chaleil R, Jiménez-García B, Bates PA, Fernandez-Recio J, et al.Updates to the integrated protein–protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J Mol Biol. 2015; 427(19):3031–41.
Article CAS PubMed PubMed Central Google Scholar
Fox NK, Brenner SE, Chandonia JM. SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014; 42(D1):304–9.
Article CAS Google Scholar
Li W, Godzik A. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22(13):1658–9.
Article CAS PubMed Google Scholar
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23):3150–2.
Article CAS PubMed PubMed Central Google Scholar
Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012; 13(Feb):281–305.
Google Scholar
James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning vol. 6. New York: Springer; 2013.
Book Google Scholar
Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc Series B (Stat Methodol). 2010; 72(4):417–73.
Article Google Scholar
Liu FT, et al.Isolation-based anomaly detection. ACM Trans Knowl Discov Data. 2012; 6(1):3.
Article CAS Google Scholar
Xue LC, Dobbs D, Honavar V. HomPPI: a class of sequence homology based protein–protein interface prediction methods. BMC Bioinformatics. 2011; 12(1):244.
Article CAS PubMed PubMed Central Google Scholar
Jordan RA, Yasser EM, Dobbs D, Honavar V. Predicting protein–protein interface residues using local surface structural similarity. BMC Bioinformatics. 2012; 13(1):41.
Article CAS PubMed PubMed Central Google Scholar
Hamer R, Luo Q, Armitage JP, Reinert G, Deane CM. i-Patch: Interprotein contact prediction using local network information. Proteins Struct Funct Bioinforma. 2010; 78(13):2781–97.
Article CAS Google Scholar
Blaber M, Zhang X-J, Matthews BW. Structural basis of amino acid helix propensity. Sci New York Then Washington. 1993; 260:1637.
Article CAS Google Scholar
Biou V, Gibrat J, Levin J, Robson B, Garnier J. Secondary structure prediction: combination of three different methods. Protein Eng. 1988; 2(3):185–91.
Article CAS PubMed Google Scholar
Maxfield FR, Scheraga HA. Status of empirical methods for the prediction of protein backbone topography. Biochemistry. 1976; 15(23):5138–53.
Article CAS PubMed Google Scholar
Tsai J, Taylor R, Chothia C, Gerstein M. The packing density in proteins: standard radii and volumes. J Mol Biol. 1999; 290(1):253–66.
Article CAS PubMed Google Scholar
Nakashima H, Nishikawa K. The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins. FEBS Lett. 1992; 303(2-3):141–6.
Article CAS PubMed Google Scholar
Cedano J, Aloy P, Perez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. J Mol Biol. 1997; 266(3):594–600.
Article CAS PubMed Google Scholar
Lifson S, Sander C. Antiparallel and parallel β-strands differ in amino acid residue preferences. Nature. 1979; 282(5734):109–11.
Article CAS PubMed Google Scholar
Miyazawa S, Jernigan RL. Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins Struct Funct Bioinforma. 1999; 34(1):49–68.
Article CAS Google Scholar

Download references

Acknowledgments

We are grateful to Prof. Giuseppe Zanotti for his helpful insights.

Funding

This research has been partially supported by the University of Padova project CPDR150813/15 “Models and Algorithms for Protein–Protein Docking”.

Availability of data and materials

The datasets generated and/or analysed during the current study are available in the figshare repository https://doi.org/10.6084/m9.figshare.5354293. The complete list of selected features and the prediction results for each protein class are available in the supplemental files. The binaries (Linux x64) used to compute the training and testing samples and the Python scripts are available at the URL https://github.com/sebastiandaberdaku/PPIprediction.

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, 35131, Italy
Sebastian Daberdaku & Carlo Ferrari

Authors

Sebastian Daberdaku
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Ferrari
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SD and CF designed the study. SD prepared the data, developed the tools used to simulate and analyze the data and produced the results. SD and CF analyzed the results and wrote the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Sebastian Daberdaku.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1

Contains additional information on some technical aspects of the research. (PDF 1360 kb)

Additional file 2

Contains the indices of the selected features for each protein class. (CSV 1.36 kb)

Additional file 3

Contains the prediction results on the test set for each protein, summarised in Table 8. (XLSX 67.9 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Daberdaku, S., Ferrari, C. Exploring the potential of 3D Zernike descriptors and SVM for protein–protein interface prediction. BMC Bioinformatics 19, 35 (2018). https://doi.org/10.1186/s12859-018-2043-3

Download citation

Received: 29 August 2017
Accepted: 24 January 2018
Published: 06 February 2018
DOI: https://doi.org/10.1186/s12859-018-2043-3

Exploring the potential of 3D Zernike descriptors and SVM for protein–protein interface prediction

Abstract

Background

Results

Conclusions

Background

Methods

Protein surface representation

Interfacial regions of the protein surface

Residue feature set

3D Zernike descriptors

Patch representation using 3D Zernike descriptors

Support vector machine

Performance measures

Dataset

SVM model selection

Interface residue prediction

Results and discussion

Model selection results

Prediction results on the test set

Post-processing

Comparison with other methods

Conclusions

References

Acknowledgments

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional files

Additional file 1

Additional file 2

Additional file 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us