Hierarchical representation for PPI sites prediction

Background Protein–protein interactions have pivotal roles in life processes, and aberrant interactions are associated with various disorders. Interaction site identification is key for understanding disease mechanisms and design new drugs. Effective and efficient computational methods for the PPI prediction are of great value due to the overall cost of experimental methods. Promising results have been obtained using machine learning methods and deep learning techniques, but their effectiveness depends on protein representation and feature selection. Results We define a new abstraction of the protein structure, called hierarchical representations, considering and quantifying spatial and sequential neighboring among amino acids. We also investigate the effect of molecular abstractions using the Graph Convolutional Networks technique to classify amino acids as interface and no-interface ones. Our study takes into account three abstractions, hierarchical representations, contact map, and the residue sequence, and considers the eight functional classes of proteins extracted from the Protein–Protein Docking Benchmark 5.0. The performance of our method, evaluated using standard metrics, is compared to the ones obtained with some state-of-the-art protein interface predictors. The analysis of the performance values shows that our method outperforms the considered competitors when the considered molecules are structurally similar. Conclusions The hierarchical representation can capture the structural properties that promote the interactions and can be used to represent proteins with unknown structures by codifying only their sequential neighboring. Analyzing the results, we conclude that classes should be arranged according to their architectures rather than functions. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04624-y.

carried out by interacting with other molecules, including other proteins, RNAs or DNAs, and small ligands [1]. The interactions between two proteins, known as protein-protein interactions (PPIs), determine the metabolic and signaling pathways [2], whose dysfunction or malfunction, as well as alterations in protein interactions, cause several diseases, with the most notable ones being neurodegenerative disorders [3] and cancer [4].
The fast, correct, and reliable identification of PPI sites facilitates understanding the role a protein has in the different biological functions and helps the understanding of the molecular mechanisms of diseases with direct applications in the discovery of new drugs [5][6][7]. Since experimental methods, including NMR and X-ray crystallography, are labor-intensive, time-consuming, and have high costs, computational methods to predict the PPI sites play a fundamental role. These methods can be roughly divided into sequence-based, structure-based, and hybrid. The sequence-based ones employ information derived from the amino acid sequence alone and use various physico-chemical properties of residues to identify the interface regions. Examples of these features are interface propensity, hydrophobicity, and electrostatic desolvation. However, structural attributes, such as secondary structure and solvent accessibility, are predicted from sequences. A detailed description of the most significant sequence-based methods is reported in [8]. On the other hand, structural-based approaches extract information from the protein shape. The features include solvent-accessible surface area, secondary structure, crystallographic B-factors, and local geometries. Finally, the hybrid methods combine both sequence and structure-derived information.
Several types of computational approaches have been proposed for the PPIs prediction. Among the sequence-based methods, the representative ones include PPiPP [9], PSIVER [10], DLPred [11], NPS-HomPP [12], and LSTM-PHV [13]. PPiPP predicts PPIs by using the position-specific scoring matrix (PSSM) and amino acid composition; PSIVER takes advantage of the PSSM and predicted accessibility as input for a Naive Bayes classifier. DLPred uses a long-short term memory (LSTM) neural network to learn features such as PSSM, physical properties, and hydropathy index. NPS-HomPPI infers interfacial residues from the ones of homologous interacting proteins. LSTM-PHV uses the long short-term memory model with the word2vec embedding and represents the amino acid sequence context as words.
However, it is apparent that more information is required to achieve higher accuracy in predicting PPIs: structural features are important discriminative attributes. In [14], You et al. proposed an approach that transforms the PPI network into a low-dimensional metric space and predicts the PPI sites based on the similarity between the points in the embedded space. In [15], Guo et al. defined a method based on autocovariance coding and support vector machine algorithm. Zhang et al. proposed PredUs, an interactive web server for interfaces prediction based on structural neighbors and a measure of structural similarity of protein structure [16]. With the same aim, Kufareva et al. developed a method based on local statistical properties of the protein surface derived at the level of atomic groups [17]. PrISE uses only the interface structure for template identification, which increases its prediction coverage [18]. On the other hand, some methods based on protein structures take advantage of sequence features. Daberdaku and Ferrari proposed a method based on molecular surface representations that use Zernike descriptors enriched with a chosen subset of physico-chemical properties [19,20]. Finally, SPPIDER [21] uses the relative solvent accessibility to sequence together with structural features.
Most of the described approaches employ classic machine learning algorithms, including support vector machines, neural networks, and k-Nearest Neighbor. Recent developments of neural networks include deep learning techniques, which have been successfully applied for PPI prediction. Representative sequence-based methods take advantage of Recursive Neural Network architecture [22], a stacked autoencoder [23], and Multimodal Deep Polynomial Network [24]. Structure-based methods are usually based on graphs, such as Convolutional Neural Networks [25,26] and Graph Convolutional Networks [27,28]. Although structure-based and hybrid methods are generally more accurate than sequence-based ones, their applicability is limited because they require knowledge of protein structures. Instead, most of the proteins, especially those involved in transient binding interactions or the engineering phase, do not have experimentally determined 3D configurations. In the PPI prediction, one of the biggest challenges for the graph-based deep learning methods is to abstract the proteins to capture the conformational aspects that change when proteins interact with their binding partners [29]. Structural features extracted from unbound proteins may not exist in bound complexes due to conformational changes induced by or required in the binding. In the literature, the performances of methods trained on the bound versions of proteins usually are better than ones obtained by considering unbound proteins [20].
This work mainly focuses on the challenge of proteins representations in the case of partner-independent predictions of interfaces, i.e., the prediction is carried out on the single proteins without any knowledge of the potential binding partner. We consider known experimentally-determined three-dimensional structures, in both unbound and bound versions. To face the protein representations challenge, we abstract the protein shapes into a low-dimensional metric space, referred to as hierarchical representation. This representation formalizes the hierarchical nature of proteins. The protein consists of a amino acids sequence (also called primary structure) that folds back into local and functional patterns (or secondary structure). Such local motifs arrange themselves into global configurations (domains and tertiary structures) to enable their biological tasks. This description shows an intrinsic hierarchy because the secondary structure contains all amino acids precedence, i.e., the primary structure, equipped with local spatial knowledge. Moreover, the tertiary structure formalizes all information related to the sequence of amino acids and the local spatiality. To formalize such a hierarchy that characterizes the protein shapes, we introduce the sequential and spatial neighborhood relationships among amino acids.
Two residues are sequential neighbors if they are consecutive in the sequence, or spatial neighbors if their two alpha carbons ( C α ) are located at distance smaller than a given threshold.
To predict the interaction sides, we designed a hybrid method, called HSS-PPI. It exploits the Graph Convolutional Networks, a deep learning framework [30], as a computational approach. We used eight physico-chemical features to represent the molecular physico-chemical aspects. We also consider the more common structural features according to the biological hypothesis related to the link between shape and function. The physico-chemical indexes are selected from the AAindex1 dataset, a database of numerical value [31]. By using a consensus fuzzy clustering method on all available indices in the AAindex1, Saha et al. identified three high quality subsets (HQIs) of all available in-dices, namely HQI8, HQI24 and HQI40 [32]. The features of the HQI8 amino acid index set were employed in this work. In [20], these features were shown to adequately discriminate interface patches from non-interface ones in bound and unbound Ab structures.
The structural features are the solvent-accessible surface area, relative solvent accessible surface area, Torsion angles PHI and PSI, and the number of residue contacts. Thanks to this molecular abstraction, the proposed approach can be trained on molecules with known structures, as well as on ones with unknown or partially known three-dimensional structures. This way, distinctively from the other methods in the literature [33], the proposed computation approach enables us to use the structural knowledge from other molecules to predict the PPIs of molecules with unknown spatial configurations.
Our study, based on the idea introduced in [34], takes into account the eight classes of proteins extracted from the Protein-Protein Docking Benchmark 5.0 [35], and considers four reasonable values (6Å, 8Å, 10Å, and 12Å) for the distance threshold to abstract the proteins. To investigate the effect of different representations, we applied the framework using hierarchical protein representations, contact mapping, and, finally, the residue sequence. The prediction results were evaluated using six metrics, namely: the area under the receiver operating characteristic curve (AU-ROC), the accuracy, the precision, the recall, the F-measure, and the Matthews correlation coefficient. Finally, such results are compared to the ones obtained with some state-of-the-art protein interface predictors (SPPIDER, PrISE, and NPS-HomPPI).

Materials and methods
The prediction of PPI sites is a classification procedure of nodes in a graph: each node represents a single amino acid of a protein. The aim is to assign a label, either 1 (interface) or 0 (no-interface), to each node. To address the problem, we take advantage of GCNs. We consider dimers from the Protein-Protein Docking Benchmark 5.0 (DB5) to make a reasonable comparison with the existing PPIs prediction results.

Benchmark dataset
The data used for the analysis are the Protein-Protein Docking Benchmark 5.0 (DB5). It consists of 230 complexes for bound and unbound versions. Each complex is made up of at least 30 amino acids, characterized by a resolution greater than 3.25 Å. To build our datasets, we divided these complexes according to the eight functional classes proposed by the DB5. The classes are (1) Antibody-Antigen (A), (2) Antigen-Bound Antibody (AB), (3) Enzyme-Inhibitor (EI), (4) Enzyme Substrate (ES), (5) Enzyme complex with a regulatory or accessory chain, (ER) (6) Others, G-protein containing (OG), (7) Others, Receptor containing (OR), (8) and Others, miscellaneous (OX). For each class, we separated the bound version from the unbound one. Moreover, we split the receptors from ligands. In this way, for each functional group, we consider four datasets: unbound ligands, bound ligands, unbound receptors, and bound receptors. We prepared these datasets to understand the most suitable representation for different protein complex classes. To make a reasonable comparison with the existing PPIs prediction approaches, we split the data into training and test sets following [20]. Furthermore, we randomly split the training set into two parts (training set and validation set) as shown in the Additional file 1. In these datasets, we consider amino acids as interface residues if they had at least one heavy (non-hydrogen) atom within 5Å from any heavy ones of the other protein (the same threshold used in [20]).

Hierarchical representations
Protein representation is one of the challenges in graph-based deep learning method applications. The challenge consists of an abstraction of the protein shape, usually formalized in terms of atomic coordinates and represented as PDBx/mmCIF, PDB, or XML files. In the literature, a common abstraction codifies the distance between every pair of residues determined using Euclidean distance using a binary matrix, the so-called contact map [36]. Let i and j be two residues of a given protein, the respective element m i,j of the matrix is equal to 1 if the two residues are closer than a predetermined threshold, and 0 otherwise. Different contact definitions have been proposed, including the distance between the alpha carbon ( C α ) atoms with threshold 6-12 Å, the distance between the beta carbon ( C β ) atoms with cut-offs ranging from 6 to 12Å ( C α is used for Glycine), and the distance between the side-chain centers of mass.
Considering the hierarchy of protein shapes, we introduce two relationships, spatial and sequential neighboring, among amino acid pairs. Each residue pair is sequential neighboring if the two amino acids are consecutive in the sequence; otherwise, the two amino acids are spatial neighboring if their Euclidean distance is less than a fixed threshold. The two relationships allow us to distinguish chemical bonds of the primary sequence from the other ones established during the folding process and depended on the amino acid distance. Taking into account this observation, we define the hierarchical representation by quantifying the spatial and sequential neighboring relations starting from the protein PDB file. Let i and j be two residues of a given protein, we assign 1 if they are sequential residues in the chain, 1/(1 + x) if x if the Euclidean distance between the respective C α atoms of the resides is less than a predetermined threshold, or 0 otherwise. The definition can be summarized as follows As a consequence, each value a i,j is between 0 and 1. By exploiting the order of amino acids imposed by the primary structure, we can uniquely arrange such values into a matrix: the adjacency matrix of a weighted undirected graph with the nodes being the residues, and the weighted edges representing the relationships among them. This approach allows us to formalize proteins whose 3D structure is known together with molecules with unknown spatial configuration by codifying only the sequential neighboring. Such an aspect is crucial, for example, when a protein is engineered, and the entire structure is still unknown. (1) if i and j are sequential neighboring residues, 1 1+x if i and j are spatial neighboring residues, 0 otherwise.

Input features
The features play a fundamental role in deep learning-based classification procedures. As mentioned in the Background, PPI hybrid methods use structural and physicochemical features. We extract biochemical properties from AAindex, a database of physico-chemical and biochemical indices of amino acids and amino acid pairs published in the literature [31]. Recently, these indices have been used in different bioinformatics tasks related to proteins, including linear B-cell epitome identification [37] and proinflammatory peptide [38]. The AAindex1 section of this database consists of 566 indices. Three subsets of AAindex1 indices, namely HQI8, HQI24, and HQI40, are generated by considering the centers of 8, 24, and 40 clusters, respectively, computed using a consensus procedure over a fuzzy clustering method as in Saha et al. [32]. In this work, we consider the HQI8, reported in Table 1.
Together with the such physico-chemical and biochemical information, we extract structural features for each protein. In particular, we consider -Accessible surface area (ASA), the protein surface area accessible to the surrounding solvent; -Relative accessible surface area (rASA), a degree of residue solvent exposure [39]; -Contact number, the number of spatial neighboring of residues; -Torsion angles and : the description of the rotations of the polypeptide backbone around the bonds between N-C α and C α -C, respectively; -Amino acid types, the amino acid's identity or type.
These numerical indices form the feature vectors of our model based on the graph convolutional networks architecture. The first eight components of the vectors are the physico-chemical and biochemical indices reported in Table 1, followed by four real numbers corresponding to structural features (ASA, rASA, Contact number, and Torsion angles). Finally, the last twenty values are the components of the one-hot encoding vectors representing the types of amino acids in the protein. For these reasons, the first eight and the last twenty values of the vectors are the same for a particular type of amino acid. Instead, the remaining elements are related to the structure and are different for each residue in the sequence

Graph convolutional networks
Graph Convolutional Network is a neural network architecture proposed by Kipf and Welling [30]. The architecture works on graphs and takes advantage of their structural information aggregating them on each node from its neighborhoods in a convolutional fashion. Let G(V , E, w) be an undirected weighted graph, where V = {v 1 , v 2 , . . . v n } is the set of n nodes, E ⊂ V × V is the set of m edges, and w : E −→ [0, 1] is a weight function that for each pair of E associates a number (weight) of the interval [0, 1]. Let A be the symmetric matrix, the so-called adjacency matrix, uniquely associated with the graph G , whose element a i,j ∈ [0, 1] . For each graph G , we associate a matrix X ∈ R n×m , whose m rows represent the feature values to associate with the corresponding nodes. Finally, let L ∈ {0, 1} n be the vector of labels. The GCN model aims to learn a function to predict the labels on each node. The model takes the adjacency matrix A of graph G and the input feature matrix X as input. Each layer of the architecture is defined in terms of the following propagation rule (a non-linear function) where Z (0) = X is the input feature matrix. Each layer Z (h) corresponds to a feature matrix, whose rows correspond to the features representing the corresponding nodes. Each layer aggregates these features to form the next layer's features using the propagation rule f. The propagation rule used in this framework is where I is the identity matrix, and D is the node degree of A + I . D is a diagonal matrix, whose elements d i,i equal the number of incident edges of node v i incremented by one. W (h) is the weight matrix for layer h, and σ is a non-linear activation function. In this work, the Rectified Linear Unit (ReLU) function is appliyed. The feature aggregation for each node is calculated using the following vector equation where j iterates over the neighboring nodes of v i , and c ij is a normalization constant obtained from the adjacency matrix to account for the degree difference between v i and v j .

HSS-PPI, a hybrid method based on protein shape and sequence for PPI site prediction
We designed HSS-PPI, a hybrid method based on protein shape and sequence for PPI site prediction. Our overall framework consists of three steps. The first one is to abstract the protein formalized in terms of atomic coordinates of the PDB file. The proposed graph-based abstractions are the hierarchical structure, the contact map, and the sequence: each one of them can be represented by an adjacency matrix. As Fig. 1, we associate the adjacency matrix of the selected abstraction to each protein.
Such matrices are the blocks of the sparse block-diagonal adjacency matrix, represented on the right of Fig. 1, which is the input of our model, as shown on the left of Fig. 2. The composite adjacency and feature matrices are split into training, validation, and testing sets using the corresponding row indices as boundaries. The second step consists in adding structural features to the representation. Each protein feature is formalized as a vector. The feature vectors are concatenated to obtain the respective feature matrices, as shown in the center of Fig. 2. In the third and final step, we use the graph convolutional network technique to predict the site of PPIs, as represented on the left of Fig. 2. The model classifies the labels of the nodes in the validation and test sets in a semi-supervised fashion, since only the labels of the elements belonging to the training set are provided as an input to the GCN model. Input Labels L GCN(A, X, L) Model Output Labels Y

Fig. 2 PPI interface residue classification with a semi-supervised GCN framework
We compare the performance of the approach with other state-of-the art methods, one proposed by Daberdaku and Ferrari [20], NPS-HomPPI [12], PrISE [18], and SPI-DER [21,40]. As detailed discussed in the Results section, the performance values of these competitors are taken from [20].
The method proposed by Daberdaku and Ferrari [20] takes into account the molecular surface representations for describing protein structure. It considers local surface descriptors based on 3D Zernike moments to identify potential binding sites. These descriptors, which are invariant to roto-translations, are extracted from the protein surface and are enriched with eight physico-chemical properties. Furthermore, it uses Support Vector Machines as a classifier to distinguish interacting local surfaces from non-interacting ones. NPS-HomPPI [12] is a homology-based method that can be used to predict interface residues without any knowledge of the interaction partner. It is based on similarity criteria required for accurate homology-based inference of interface residues in query protein sequence homologs based on the assumption that homologs share significant similarities in sequence, structure, and functional sites. Moreover, NPS-HomPPI classifies the templates into either Safe, Twilight, or Dark Zone, and uses multiple templates from the best available zone to infer interfaces for query proteins. PrISE [18] is a local structural surface similarity-based computational method for predicting the PPI sites. This method represents each local surface structure by using "structural elements". Each of them consists of a central residue and its surrounding surface neighbors that are represented by their atomic composition and accessible surface areas. The approach decomposes molecular surfaces into many structural elements and searches these elements into pre-calculated databases for similar structural elements with experimentally determined interface information. Finally, it weighs them according to their similarity with the structural elements of the query protein. SPPIDER [21,40] identifies and recognizes the interface residues site by integrating enhanced relative solvent accessibility (RSA) predictions with high resolution structural data. The approach is based on the concept of "fingerprint" that is derived from the difference between the predicted and actual relative accessible surface area (rASA) of residues as features for interface prediction. Furthermore, SPPIDER uses a consensus method that combines the output of 10 Neural Networks with the majority voting to merge the most informative features into the final predictor.

Implementation
The framework is implemented in Python using Biopython and TensorFlow 2.0 [41,42]. We use the PDBParser and DSSP modules of BioPython to abstract the shape into a graph and extract the structural features, respectively. The URL https:// gitlab. com/ sebas tiand aberd aku/ hss-ppi points to the dataset and source code that form our framework. The framework was tested on an HPC Server with eight 12-Core Intel Xeon Gold 5118 CPUs running at 2.30 GHz and using 1.5 TB RAM. We set 32 parallel threads under OS Fedora Linux 25. We used Stochastic Gradient Descent as an optimization algorithm, with a learning rate and a dropout value equal to 0.001 and 0.5, respectively. Moreover, we set two hidden layers with 35 and 32 features. Empirical observation during the experimental face helped us in setting 1500 as the maximum number of reached epochs: the training can stop earlier if the performance on the validation set stopped improving.
Moreover, we train our deep learning framework on three protein abstractions, the contact map, the residue sequence, and the hierarchical representation, considering distance thresholds of 6 Å, 8 Å, 10 Å, and 12 Å.

Performance evaluation
Predicting interfacial residues can be formulated as a binary classification problem where each protein residue can be either interfacial or non-interfacial. We evaluate the performance of our approach and compare it with one of some other methods in the literature using six evaluation metrics: Accuracy (Acc), Precision (P), Recall (R), F-measure ( F 1 ), the area under the receiver operator characteristic curve (AUC), Matthews correlation coefficient (MCC) where TP represents the number of interaction sites identified correctly (true positive), FN denotes the number of interaction sites identified incorrectly (false negative), FP represents the number of non-interaction sites identified incorrectly (false positive), TN denotes the number of non-interaction sites identified correctly (true negative). As mentioned earlier, the prediction interface is an imbalanced learning problem. Therefore, F-measure, MCC, and AUC are the three most important evaluation metrics as they can provide morecomprehensive measures than other evaluation metrics [43].

Comparison with other methods and discussion
The performance results, evaluated using six metrics (F1 score, Accuracy, Precision, Recall, MCC, AU-ROC) for our method, HSS-PPI, are presented in Tables 3 and 4. We compare these results with the performance obtained by competitors, one proposed by Daberdaku and Ferrari [20], NPS-HomPPI [12], PrISE [18], and SPIDER [21,40]. Since our organization of the molecules in testing, training and validation sets is coherent to that proposed in the literature, we can take the performance values of the competitors directly from [20], where the performance of the proposed approaches is quantified only in terms of AUC-ROC. This aspect is not a limitation to comparing the methods since the prediction of PPI interface sites is a highly imbalanced classification problem. The AUC-ROC is a comprehensive comparison metric independent of any decision boundaries. Moreover, it is robust to class imbalance. Dataset sizes are shown in Table 2.
HSS-PPI performs better than the competitor predictors in the bound and unbound versions of classes A l , AB l , and OG l . It is an expected result since the ligands are small molecules that adapt their shape to interact with the receptor partner, even though  Table 3 Measures of F1 score, classification accuracy, precision, recall, MCC and ROC-AUC obtained on the test set of the ligands classes      they preserve their native architecture determined by the folding process. Therefore, methods based on the internal conformation of three-dimensional structures are more suitable than the approach based on shapes or homology like PrISE and NPSHomPPI, respectively. However, in the bound version of class A l , our method achieves a ROC-AUC of 65.1% , while, for the competitors, the maximum ROC-AUC is 63.0% obtained by SPPIDER. Similarly, for class A l , our method achieves a ROC-AUC of 71.3% , while, for the competitors, the maximum ROC-AUC is 62.6% for NPS-HomPPI. The ROC-AUC values obtained with our method take into account the contact map abstraction of proteins and considers the threshold equals 6Å and 8Å for bound and unbound versions, respectively. However, our method achieves higher ROC-AUC values than those obtained with competitive predictors independently of the representation chosen and the threshold value. The AUC-ROC values obtained with the different abstractions and thresholds are comparable, except for the contact map representation with the cut-off equals to 12Å. Also, the values of other metrics (F1 and MCC) are comparable for the bound and unbound versions of this class. However, the results obtained with the hierarchical representation are more reliable than ones obtained for contact maps. Noticeably better prediction performance is achieved in the unbound and the bound versions of class AB l . In fact, for the bound version, the ROC-AUC of our method is equal to 82.6% , and the maximum ROC-AUC among considered competitors is 68.3% (PrISE). For the unbound version, the ROC-AUC obtained with our predictor is equal to 78.7% , and the maximum ROC-AUC value achieved with the considered competitors is 71.3% (NPS-HomPPI). The ROC-AUCs obtained with our methods take advantage of contact map representation to abstract the molecules, and the threshold equals 12Å. However, as for the A l class, our method outperforms the competitors regardless of representation and threshold value. In the bound and unbound versions of protein class OG l , our method achieves a ROC-AUC of 77.4% and 69.1% , respectively.  F1 and MCC values obtained with our method using hierarchical representation outperform the ones obtained by considering the contact maps regardless of the chosen threshold. Our prediction method underperformed compared to the competitors in the bound and unbound versions of the classes ER l , ES l , and OX l . The molecules of these classes show several different architectures, as reported in Tables 4 and 5 of the Additional file 1. Therefore, the internal conformations of threedimensional structures differ from each other.

HSS-PPI
The homology-based methods are better performing than the approaches based on shapes or internal conformation. As shown in Table 3, the structural information of the molecules does not improve the performance of the methods. Moreover, we observe that the values of AUC obtained using the sequence are comparable to ones obtained by representing the proteins as contact maps and hierarchical. This aspect could depend on the datasets, formed by molecules with the same biological function rather than structures.
The proposed methodology performs better than the competitor predictors for bound and unbound classes A r , AB r , EI r , and OG r . Like to the A l , AB l , and OG l classes, this result was expected since the molecules of A r , AB r , classes shown the same architecture, while the architectures of the test set for EI r , and OG r classes are well represented in the training and validation sets. As a consequence, our method is more appropriate than the competitors. In the bound and unbound version of class A r , our method achieves a ROC-AUC of 97.6% and 98.0% , respectively. The maximum achieved ROC-AUC values from the competitors are 95.4% for the unbound and 93.9% for the bound version. Both values are obtained by the method proposed by Daberdaku and Ferrari. The ROC-AUC values obtained with HSS-PPI (our method) take into account the contact map abstraction of proteins with the threshold equals 8Å. Our method achieves ROC-AUCs that are better than ones obtained with the competitor predictors regardless of the choice of representation and threshold value. On the other hand, the value of F1, MCC, and AUC-ROC obtained with hierarchical representation mainly outperform the ones related to the abstractions based on the contact map. In the bound version of class AB r , our method achieves a ROC-AUC of 95.8% , while, for the competitors, the maximum ROC-AUC is 89.0% obtained by the method proposed by Daberdaku and Ferrari. Similarly, in the unbound version of class AB r , our method achieves a ROC-AUC of 96.5% , while, for the competitors, the maximum ROC-AUC is 84.5% for Daberdaku and Ferrari's method. The ROC-AUC values obtained with our method take into account the contact map abstraction of proteins and consider the threshold equals 10Å. The values of other metrics (F1 and MCC) related to hierarchical representation are mainly better than ones obtained for contact maps. In the bound and unbound version of class EI r , our method achieves a ROC-AUC of 77.0% and 74.5% , respectively. The maximum ROC-AUC values of competitors are 76.4% and 74.7% . These values are obtained by the methods proposed by Daberdaku and Ferrari. The ROC-AUC values related to our method take into account hierarchical representation as protein abstraction and considers the threshold equals 10Å, while the F1 and MCC values are comparable. In the bound version of class OG r , our method achieves a ROC-AUC of 75.6% , while, for the competitors, the maximum ROC-AUC is 74.8% obtained by the method proposed by SPPIDER. Our method achieves a ROC-AUC of 74.8% for the unbound version, which is comparable with the ROC-AUC of 76.8% obtained by NPS-HomPPI. However, for both bound and unbound versions, the F1 and MCC values obtained by using the hierarchical representation outperform the value related to the contact map. The results of our prediction method are comparable with the competitors for proteins of class ER r in the bound and unbound versions, while our prediction method underperformed compared to the competitors in the bound and unbound versions of the classes ES r , OG r , and OX r . Observing Table 4, we note that the values of AUC obtained using the sequence are comparable or better than ones obtained by representing the proteins as contact maps and hierarchical. Hence, as observed for ligands, we could suppose that also the elements in ES r , OG r , and OX r exhibit more sequential similarity than the structural one.
By analyzing the results, it emerges that the threshold value used to construct contact maps or hierarchical representations dramatically affects the performances of our method in varying ways for different datasets. We note different effects when the threshold value is increased. In some datasets, increasing the threshold value determines an increment of performance, while, in others, this leads to a decrease. Since the cut-off is a purely geometric value, the abstractions can mainly capture the local structural motifs (i.e., elements of secondary structures) if the cut-off is low (i.e., 6-8 Å). On the other hand, a higher cut-off value also captures the geometrical relations of the global structure (i.e., tertiary structure). This additional information starts playing a fundamental role. Our experiments bring out a link between the cut-off and the molecular architectures.
We note that in the dataset whose molecules exhibit the same architecture (for example the class of AB receptors which consists of paired heavy-light chains (H-L) with the same structure [44]), the performance values of the approach increase with an increasing threshold. The performance decreases with an increasing cut-off, for example in the EI receptor dataset (which consists of enzymes), where the molecules of the dataset exhibit many different architectures. The cause is intrinsic to the hierarchical nature of the shape, which is strictly related to the protein's biological tasks. The elements of secondary structures play a critical role in several functions like PPI interactions [45]. These elements, and, thus, the relative contact maps and hierarchical representations, are similar regardless of their arrangements in their global 3D configuration. Instead, such structures, i.e., tertiary structures, are comparable only among molecules with the same architecture (such AB r dataset) [46]. Therefore, the threshold increment in the dataset with molecules characterized by heterogeneous architectures results in adding noise, while such increment represents further details in the other case. Figures 3 and 4 show the Receiver Operating Characteristic curve and Precision Recall curve obtained by the GCN model on each dataset of ligands and receptors, respectively. We have trained the model by using the contact map and hierarchical abstraction with 500 epochs. For each abstraction, we have considered four different thresholds (6Å, 8Å, 10Å, and 12Å).
To interpret the results, we consider the architectures of the ligands and receptors according to the CATH classification. In particular, the architectures of the molecules are Sandwich, 3-Layer(aba) Sandwich, 2 Solenoid, Orthogonal Bundle, Fig. 3 Average Receiver Operating Characteristic curve comparison of the proposed PPI interface prediction method by using hierarchical representation, contact map and sequence as protein representation with different thresholds (6Å, 8Å, 10Å, 12Å) for each protein class into account such ratios and the performance of our method, we observe a relation between them. For example, we observe that our method achieves optimal results for classes A r and AB r , whose molecules mainly show a single architecture, the Sandwich one. Furthermore, our approach does not reach sufficient values for some classes like  Table 4 Measures of F1 score, classification accuracy, precision, recall, MCC and ROC-AUC obtained on the test set of the receptors classes       the EI r , whose molecules show eight different architectures. In particular, the Alpha-Beta Complex and Propeller are two architectures that characterize some molecules in the test set, but they are not present in the training and validation set. Therefore, site prediction of some molecules like 1QQU and 1RGH, classified as easy to predict in the Benchmark, shows low performance. The observation also works for the ligand classes.
To confirm this observation, which may seem like an intuition, we conducted further experimentation. Looking at the Benchmark, we noted the 2-Layer Sandwich architecture spans over all classes of ligands, i.e., 2-Layer Sandwich is the most represented architecture in the Benchmark. Thus, we consider all proteins that exhibit 2-Layer Sandwich architecture despite their biological group. The list of the molecule group is reported in Table 3 in the Additional file 1, while the performance results, evaluated using the six metrics, are presented in Table 5. We observe that the best value of AUC-ROC is 85% , obtained with a threshold equal to 12Å. We also note that this result is the best one among all ligand biological classes. Moreover, it is evident that the threshold value used to construct contact maps or hierarchical representations dramatically changes the performance of our method. We note that in the dataset whose molecules show the same architecture, the performance values of the approach increase with an increasing threshold. Moreover, we observe that the performance decreases with increasing cut-off if molecules of the dataset show many different architectures. These observations lead us to hypothesize that the structural organization of amino acid interactions changes depending on the type of architecture. The increment of the threshold in the dataset with molecules characterized by several architectures may mean adding noises, while such increment represents further information in the other case. These results and observations confirm the biological hypothesis that protein behaviors depend on their three-dimensional conformations. The analysis reveals that organizing the dataset by considering the structural similarity improves the performance of our method. Thus, structural classifications of proteins play a fundamental role, and they can be faced by computation methods without any requirement of biological experiments, which are expensive and time-consuming.
These results and observations agree with the biological hypothesis that protein behaviors depend on their three-dimensional conformations. Analyzing the results, it is evident that the performance of our approach can be improved by a structural classification to select the molecules of training and validation sets.
Such structural classification of proteins can be faced by computational methods without requiring any biological experiments which can be expensive and time-consuming.
Another suggestion is related to the features that can be selected from the AAindex databases taking into account the molecular class to capture the physical-chemical characteristics exhibited by the molecules in the diverse interaction mechanisms. Furthermore, the molecular abstraction that we have introduced in terms of spatial and sequential relationships between amino acid is another important feature since it allows us to formalize both proteins whose 3D structure is known both unknown. Such an aspect is crucial, for example, when a protein is engineered, and the entire structure is still unknown since the method can be trained by taking advantages of molecules with known structures. Quadrini et al. BMC Bioinformatics (2022) 23:96 Table 5 Measures of F1 score, classification accuracy, precision, recall, MCC and ROC-AUC obtained on the test set of the ligands classes

Conclusions and future work
In this work, we have focused on the PPI sites prediction by considering their experimentally-determined structures. We have classified the amino acids into interface and no-interface using the Graph Convolutional Networks technique, a deep learning framework. To test the approach, we have applied the framework to the dimers of DB5, divided into eight functional classes. Moreover, we have considered three representations (hierarchical representations, contact maps, and residue sequences). We have also considered different thresholds for the distances between C α atoms (6Å, 8Å, 10Å, and 12Å). In the literature, other structure-based methods have been proposed and tested [33]. However, the proposed molecular abstraction, obtained by quantifying spatial and sequential relationships among amino acids and referred to as hierarchical structure, is another relevant feature. Thanks to the representation, HSS-PPI trains on molecules with known structures together with ones with unknown or partially known three-dimensional structures. As a result, differentiating from the methods presented in the literature, our approach allows us to consider the structural knowledge of other molecules to predict PPIs of molecules with unknown or partially known spatial configurations. Such an aspect is crucial, for example, when a protein is engineered, and the entire structure is still undetermined. Consequently, we can conclude that the performance depends on the molecules' structural similarity Therefore, our approach works better on proteins with similar structures rather than similar functions. As future work, motivated by the results' analysis, we have planned to apply our framework on paratope interacting residue prediction. Moreover, we have also decided to use our framework considering another classification of the DB5 according to protein threedimensional structures. In this scenario, motivated by our previous results obtained from RNA secondary structures with pseudoknots comparison [47,48], we believe that it is important to compare and classify the protein structures considering their tree representations and exploiting edit distance or alignment algorithms. Although some structural classifications have been proposed in the literature, like SCOP [49] or CATH classification [50], our approach, which is an extension of the one proposed to compare RNAs, will work on polynomial time and will neglect the sequence of amino acids.
Motivated by the different performances obtained considering the proposed datasets (corresponding to functional classes of DB5) with a set of fixed features, another important direction is to investigate different feature set by using feature selection methods. In this way, we can consider a set of input variables and choose the more representative quantities for each functional protein group. It is also interesting to select different features for receptors and ligands separately. We will focus our attention on the structural properties by exploring the RNA-based topological methodology introduced in [51]. Finally, another important future direction to explore is to extend the proposed approach to identify the binding partner specificity.