Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins

Background Recognition of binding sites in proteins is a direct computational approach to the characterization of proteins in terms of biological and biochemical function. Residue preferences have been widely used in many studies but the results are often not satisfactory. Although different amino acid compositions among the interaction sites of different complexes have been observed, such differences have not been integrated into the prediction process. Furthermore, the evolution information has not been exploited to achieve a more powerful propensity. Result In this study, the residue interface propensities of four kinds of complexes (homo-permanent complexes, homo-transient complexes, hetero-permanent complexes and hetero-transient complexes) are investigated. These propensities, combined with sequence profiles and accessible surface areas, are inputted to the support vector machine for the prediction of protein binding sites. Such propensities are further improved by taking evolutional information into consideration, which results in a class of novel propensities at the profile level, i.e. the binary profiles interface propensities. Experiment is performed on the 1139 non-redundant protein chains. Although different residue interface propensities among different complexes are observed, the improvement of the classifier with residue interface propensities can be negligible in comparison with that without propensities. The binary profile interface propensities can significantly improve the performance of binding sites prediction by about ten percent in term of both precision and recall. Conclusion Although there are minor differences among the four kinds of complexes, the residue interface propensities cannot provide efficient discrimination for the complicated interfaces of proteins. The binary profile interface propensities can significantly improve the performance of binding sites prediction of protein, which indicates that the propensities at the profile level are more accurate than those at the residue level.


Background
Protein function is very often encoded in a small number of residues located in the functional active site, which are dispersed around the primary sequence, but packed in a compact spatial region [1]. Recognition of functional sites in proteins is a direct computational approach to the characterization of proteins in terms of biological and biochemical function. Localization of functional sites will allow us to understand how the protein recognizes other molecules, to gain clues about its likely function at the level of the cell and the organism, and to identify important binding sites that may serve as useful targets for pharmaceutical design [2].
Recently, a series of computational efforts to identify interaction sites or interfaces in proteins have been undertaken. A number of studies on the characteristics of protein interfaces have provided clues for binding site prediction. Several methods have been proposed to predict these sites based on the sequence or structure characteristics of known protein-protein interaction sites.
In terms of physical chemistry, protein interfaces are generally observed to be more hydrophobic than the remainder of the protein surface [3,4]. Moreover, the interfaces of permanent complexes tend to be more hydrophobic when compared to those of transient complexes [5]. Some interfaces have a significant number of polar residues [6], usually where interactions are less permanent [7]. Charged side-chains are often excluded from protein-protein interfaces with the exception of arginine [8], which is one of the most abundant interface residues regardless of interaction types [9]. The evolutionary conservation of residues is another property that may be utilized to predict protein-protein interfaces [10]. The evolutionary trace (ET) method tries to identify functional sites by using the sequence variations and functional divergences found in nature [11,12]. Accurate ET analysis requires functionally relevant sequence and high-quality alignments as input [13]. A structure-independent criterion has been presented to measure the quality of evolutionary trace [14]. Because sequence conservation reflects not only evolutionary selection at binding sites to maintain protein function, but also the selection throughout the protein to maintain the stability of the folded state [15], many researchers try to distinguish functional and structural constraints on protein evolution [16,17]. A comprehensive evaluation of different conservation scores has been performed by Valdar [18]. Other sequence information has also been exploited such as the phylogenetic profile [19,20], the sequence motifs [21], sequence profile [22,23], evolution rate [24,25], etc.
The features extracted from the three-dimensional structures of protein complexes are critical for a full understanding of the mechanism of interactions because they provide specific interaction details at the atomic level. The accessible surface area (ASA) is one of the most widely used features [26]. Molecular docking seems to be the most principled computational approach for identifying the interaction sites [27], but it requires the precise design of energy function [28], either physical energy [29] or empirical scoring functions [27,30]. 3D-motifs have also been successfully used to identify binding sites of the same type in proteins with different folds [31][32][33][34]. Patch analysis using a six-parameter scoring function can distinguish the interface from other surfaces [3].
Because none of the above-mentioned properties is able to make an unambiguous identification of interface regions or patches, a combination of some of them (via either a linear combination [35] or machine learning [36]) is found to be effective for improving the accuracy of binding-site prediction [37]. The PINUP method predicts interface residues using an empirical score function made of a linear combination of the energy score, interface propensity and residue conservation score [38].
Rossi et al. first construct a scoring function, and then perform a Monte Carlo optimization, to find a good scoring patch on the protein surface [39].
Machine Learning Methods are well suited to the classification of interface and non-interface surface residues [40,41]. Neural networks [42] and support vector machine [43,44] have been applied in this field. These studies take sequential or structural information as input [6]. Other researchers adopt two-stage model [23] to further improve the performance. Recently, the conditional random field (CRF) model has been introduced, which formalizes the prediction of protein interaction sites as a sequence-labeling task [45].
In this study, we revisit the difference of amino acid compositions between the interface area and other surface area. Although some researchers have found that there are different amino acid compositions among the interaction sites of different complexes (homo-permanent complexes, homo-transient complexes, hetero-permanent complexes, and hetero-transient complexes) [46], such difference has not been integrated into the prediction process. Here, the residue interface propensities of different complexes are collected. These propensities, combined with sequence profiles and accessible surface areas, are inputted to the support vector machine for the prediction of protein binding sites. Such propensities are further improved by taking evolutional information into consideration. The frequency profiles are directly calculated from the multiple sequence alignments outputted by PSI-BLAST [47] and converted into binary profiles [48] with a probability threshold. As a result, the protein sequences are represented as sequences of binary profiles rather than sequences of amino acids. Similar to the residue interface propensities, a class of novel propensities at the profile level is introduced. Binary profiles can be viewed as novel building blocks of proteins. It has been successfully applied in many computational biology tasks, such as domain boundary prediction [48], knowledge-based mean force potentials [49], protein remote homology detection [50] etc. Experimental results show that the binary profile propensities significantly improve the performance of binding sites prediction of proteins.

Residue interface propensities
Residue interface propensities are good indicators for binding sites and have been widely used in many studies [6]. The residue interface propensities of the four kinds of complexes are shown in Fig. 1. Positive propensity means that the residue is abundant in the interface while negative propensity means that the residue is abundant in the surface area.
The four kinds of complexes have similar residue interface propensities. They all show that hydrophobic residues (F, I, L, M, V) and some polar aromatic residues (W, Y, H) are favored in interface area. The charged residue R also shows preferences for the interface area. Other polar amino acid T, E and small amino acid P, A are disfavored in the interface. The same phenomena have been observed by others [35] although some researchers evaluated the ASA contribution for amino acid [3,38] while we count them. Biophysically similar residues, such as L and I, or D and E, usually showed similar trends, indicating the reliability of the data.
There are minor differences among the four kinds of complexes. Although many amino acids show the same trend for interface area or surface area, the propensities are different for the four kinds of complexes. Further more, some amino acids reveal different propensities in different complexes. Amino acid Q, S and T show preferences for the hetero complexes rather than homo complexes.
Amino acid C and L are favored in permanent complexes rather than transient complexes. Ofran and Rost [46] found that the composition of all interface types differed substantially from that of SWISS-PROT. Here we conclude that the residue interface propensities show general trends and have minor differences among different kinds of complexes.

Binary profile interface propensities
The binary profile frequencies in interface are different from those in surface area. These differences can be used to produce the discriminative binary profile propensities. In theory, the total number of binary profiles is extremely large (2 20 ), but in fact, only a small fraction of binary profiles appears, which is dependent on the choice of probability threshold P h and the dataset. Based on the results of cross-validation (Next section), the four kinds of complexes have different number of binary profiles, ranging from one hundred to several thousands. The binary pro-files and their propensities of the four kinds of complexes are listed in the Additional files (see additional file 1, 2, 3, 4). Note that the binary profiles with low occurrence times (<3) are ignored, since these profiles are not statistically significant and may introduce much noise.
An increased propensity of hydrophobic residues and their combinations in interface has been observed, such as the binary profile FHWY, ILMV. Although some amino acids are preferred in surface, the combination of these amino acids with other amino acids may be preferred in interface such as AEP, ST. Another special phenomenon is that some binary profiles only occur in interface while other binary profiles only occur in surface area. The former results in a maximum propensity (being set as 4) and the latter results in a minimum propensity (being set as -4). Each kind of complexes has many such binary profiles.
The differences of binary profile interface propensities among different complexes are significant in comparison with those of residue interface propensities. Many binary profiles show positive propensities in one complex but negative propensities in another complex. Table 1 summarizes the number of such binary profiles between any pair of complexes.

Comparative results with and without propensities
The first SVM takes profile and ASA of spatially neighboring residues as input, which are common input features used by previous studies [15,44,51]. Then we add the amino acid or binary profile interface propensities as an extra feature to evaluate whether these propensities can improve the performance or not. All the results are obtained by five-fold cross-validation.
The second SVM takes residue interface propensities as an extra feature. Table 2 gives the results with and without residue interface propensities. The similar performance indicates that the standard amino acid cannot provide efficient discrimination for the complicated interfaces of proteins. The results on homo-transient complex are extremely low because there are only 5 chains in this complex. The performance of the first SVM is comparable with those of Chung et al. [15]. They reported precision of 0.498 and recall of 0.568 with the same features on their 274 hetero-complexes.
The third SVM takes binary profile interface propensities as an extra feature instead of residue interface propensities. The probability threshold P h of converting a frequency profile into a binary profile needs to be optimized. During the validation process, three sets are used to train SVM, one validation set is used to optimize the parameter and the testing set is used to give the final results. That is, we select the values of P h that give the best results on the validation set and then such parameter is used to test the proteins on the testing set to give the final results. The influences of P h on the performance are illustrated in Fig.  2. F1 is used as the guild line since it is a tradeoff between precision and recall. The optimal values of P h are different for different complexes.
The results of cross-validation are then obtained with the optimal value of P h and shown in Table 3. The improvement of the third SVM is significant in comparison with the other two SVMs. The F1 is improved by about ten percent. According to the experimental results, we can infer that the propensities at the profile level may be more accurate than that at the amino acid level.

Comparative results with propensities from other complexes
Analysis of interface propensities shows that the residue interface propensities have minor differences among different complexes while the profile interface propensities differ significantly among different complexes. To validate Residue interface propensities of the four kinds of complexes  it, the propensities from other complexes are used as an extra feature. The results are shown in Table 4 (residuelevel) and Table 5 (profile-level).
The performances of Table 4 are close to those of Table 2, which indicates that the differences of residue interface propensities among different complexes can be negligible. The performances of Table 5 decrease significantly in comparison with those of Table 3, so the profile interface propensities are sensitive to the types of complexes. In other words, the propensities at the profile-level can give more exact description of interfaces than the propensities at the residue level.

Examples
Some examples are provided at Fig. 3. One protein is selected from each type of complexes. The true interface and the interface predicted by the second SVM and the The average F1 under different value of parameter Ph Figure 2 The average F1 under different value of parameter Ph. The F1 is obtained as the results of cross-validation at the validation dataset. third SVM are depicted. Most interface residues and noninterface residues can be predicted correctly. It is clearly that the classifier that integrates binary profile interface propensities is more accurate than the classifier that uses residue interface propensities.

Comparison with conservation scores
The conservation score is another widely used feature in prediction of function sites, which indicates the importance of a residue for maintaining the structure and function of a protein [18]. Here, we compare the binary profile interface propensities with conservation scores since both of them are derived from the multiple sequence alignment of homologues. Three conservation scores are investigated including the symbol entropy score [52], Karlin score [53] and Valdar score [54]. They are defined as follows: Please refer to [18] for detail calculation and comparison of these scores.
These conservation scores are used as an additional feature respectively and the cross-validation results are shown in Table 6. Overall the F1 is improved by about 2 percent in comparison with those without conservation scores (the first SVM).
All these conservation scores show positive correlation with binary profile interface propensities, although the Pearson correlation coefficients are small (0.017, 0.053, 0.064 for V entropy , V Karlin and V valdar respectively). The results show that the improvement by conservation scores is much lower than that by binary profile interface propensities.  Table 7. The classifiers with binary profile interface propensities outperform those with residue interface propensities by 5 percent in term of F1.

Independent testing
The results are better than those of related works. Liang et al [38]. developed an empirical scoring function for bind-  ing site prediction, which is a weighted combination of energy scores, conservation scores and residue interface propensities. They achieved the precision of 0.294 and the recall of 0.305. The overall F1 is only 0.30. Their method is trained on a small dataset (only 57 proteins). Furthermore their method is a simple combination of three features while our method is based on discriminative model.

Conclusion
In this study, the residue interface propensities of four kinds of complexes (hetero-permanent complex, heterotransient complex, homo-permanent complexes and homo-transient complex) are collected and applied in the process of predicting binding sites of proteins. Such propensities are improved by taking evolutionary information into consideration, which results in the binary profile interface propensities. Although there are minor differences among the four kinds of complexes, the residue interface propensities cannot provide efficient discrimination for the complicated interfaces of proteins. The binary profile interface propensities can significantly improve the performance of binding sites prediction of protein, which indicates that the propensities at the profile level are more accurate than those at the residue level.

Dataset
A comprehensive set of complexes is chosen from the Protein Data Bank (PDB) [55] and then subjected to a number of stringent filtering steps. All proteins with multi-chains, non-NMR structures and resolution better than 4 Å are selected. Two chains in a protein are considered as interacting pairs if at least two non-hydrogen atoms in each chain are separated by no more than 5 Å [42,56].
For PDB structure with more than two chains, each chain is selected for at most one time. For protein chain that interacts with multiple partners, only one partner with the most interfacial residues is selected as its partner. The protein chains with less than 40 amino acids are removed. The PQS web-server [57] is used to eliminate crystal packing complexes rather than biologically functional multimers. The selected chains are further filtered such that no pair of chain has more than 25% sequence identify. Finally, a total of 1139 chains are obtained.

Classification of complexes
The protein-protein interactions can be divided into different types according to different criterions [58]. In this study, the complexes are classified by the homology of interacting chains (homo versus hetero) and the lifetime of the complexes (transient versus permanent).
Using simple sequence comparisons, the complexes can be classified as homo-complexes or hetero-complexes. Two interacting protein chains were defined as homocomplex if over 90% of them are aligned and the sequence   identity over the aligned region is more than 95% [42]. All other complexes are classified as hetero-complexes.
A permanent complex is usually very stable and thus only exists in its complexed form. In contrast, a transient complex can exist in separated state. The method of differentiating the transient complexes and permanent complexes is same as the one used by Ofran and Rost [46]. The guild lines for classifying the hetero-complexes and homo-complexes into permanent and transient states are different. They are briefly described here. If the chains from the hetero-complexes are stored in the same SWISS-PROT files [59], the complexes are classified as hetero-permanent complexes; otherwise they are classified as hetero-transient complexes. All homo-complexes that are annotated as monomers in DIP [60] database are classified as homotransient complexes; otherwise they are classified as homo-permanent complexes.
The above dataset is then grouped into four kinds of complexes (hetero-permanent, hetero-transient, homo-permanent, homo-transient). The statistical information of different complexes is tabulated in Table 8. An amino acid is defined as a surface amino acid if the ASA of at least one of its atom is larger than 2 Å 2 [39]. A surface residue is considered as interface residues if its accessible surface area is decreased by more than 1 Å 2 upon complexation [38]. The ASA is calculated with the DSSP program [61]. According to this definition, 27.3% of the surface residues are interface residues. Such ratio is very close to that (28%) in Chung's dataset [15].

Calculation of propensities
The amino acid frequencies between interface and other surface area are different. Such difference can be used to produce the residue interface propensity, which is defined as the log ratio between the amino acid frequency in interface area and that in surface area: P a = In(P a, I /P a, S ) ( 4 ) where P a is the propensity of amino acid a, P a, I is the frequency of amino acid a in interface area and P a, S is the frequency of amino acid a in surface area. The frequencies can be calculated from the training set by maximum likelihood estimation: where C a, I is the count of amino acid a in interface area, C I is the total number of amino acid in interface area, C a, S is the count of amino acid a in surface area, C S is the total number of amino acid in surface area. The residue interface propensity describes the likelihood of amino acid to be found in interface area as compared to those in surface area. A propensity of 0 indicates that the amino acid has the same frequency in interface and surface area. A positive propensity means that the amino acid is over-representative in interface area.   In term of binary profile, the protein sequence is represented as sequence of binary profiles rather than sequence of amino acids. Each amino acid is replaced by the corresponding binary profiles that are derived from the multiple sequence alignments as described in the following section. The calculation formula of binary profile interface propensities are same as that of the residue interface propensities except that the subscripts are replaced by binary profiles rather than amino acid: where P b is the propensity of binary profile b, P b, I is the frequency of binary profile b in interface area and P b, S is the frequency of binary profile b in surface area. The frequencies can also be calculated by maximum likelihood estimation in the same manner of amino acid. The binary profile interface propensity contains evolution information and provides more accurate prediction of binding sites than amino acid interface propensity according to the experimental results.
Here an example of calculating the propensities of binary profiles is provided. Suppose there is a frequency profile (see Table 9): When the probability threshold P h is taken as 0.08, we get the following binary profile (see Table 10): By collecting the non-zero term in binary profile, the combination of amino acid AGLN is obtained. Suppose the frequency of AGLN is 0.00042 in interface area and 0.00021 in surface area, which are calculated by maximum likelihood estimate using equation (5) and (6). Thus, the propensity of AGLN is 0.693147 (ln (0.00042/ 0.00021)) by equation (7).

Generating of binary profiles
A binary profile can be expressed by a vector with dimensions of 20, in which each element represents one kind of amino acid and can only take value of 0 or 1. When the element takes value of 1, it means that the corresponding amino acid can occur during evolution. Otherwise, it means that the corresponding amino acid cannot occur. A binary profile can also be expressed by a substring of amino acid combination, which is obtained by collecting each element of the vector with non-zero value. Each combination of the twenty amino acids corresponds to a binary profile and vice versa. Below we describe the process of generating the binary profiles.
The PSI-BLAST [47] is used to generate the profiles of amino acid sequences with parameters j = 3 and e = 0.001. The search is performed against the non-redundant database (NR) database from NCBI. The frequency profiles are directly obtained from the multiple sequence alignments outputted by PSI-BLAST. The target frequency reflects the probability of an amino acid occurrence in a given position of the sequences. The method of target frequency calculation is similar to that implemented in PSI-BLAST.
Because the frequency profile is a matrix of frequencies for all amino acids, it cannot be directly used and need to be converted into a binary profile by a probability threshold P h . When the frequency of an amino acid is larger than P h , it is converted into an integral value of 1, which means that the specified amino acid can occur in a given position of the protein sequence during evolution. Otherwise it is converted into 0. A substring of amino acid combination is then obtained by collecting the binary profile with nonzero value for each position of the protein sequences. These substrings have approximately represented the amino acids that possibly occur at a given sequence position during evolution. Fig. 4 has shown the process of generating binary profiles.

Prediction
Support Vector Machine (SVM) is a class of supervised learning algorithms first introduced by Vapnik [62]. Given a set of labelled training vectors (positive and negative input examples), SVM can learn a linear decision boundary to discriminate between the two classes. The result is a linear classification rule that can be used to classify new test examples. SVM has exhibited excellent performance in practice and has strong theoretical foundation of statistical learning theory. Here the LIBSVM package [63] is used as the SVM implementation with radial basis function as kernel. The values of γ and regularization parameter C are set to be 0.005 and 1, respectively.  The input of SVM is a window containing a surface residue and its 12 spatially nearest surface residues [15]. An interface residue is defined as the positive sample, and a surface residue is defined as the negative sample. The input features are sequence profiles, accessible surface areas and propensities of residues in the window. The sequence profiles are taken from the Position-Specific Score Matrix (PSSM) outputted by PSI-BLAST [47]. All the input values are scaled between -1 and 1 before being inputted to the SVM.
It is known that SVM cannot perform well on an unbalanced dataset. In this dataset, only 27.3% of the surface residues are interface residues. If all surface residues are used in the training, the classifier will be biased to predict a residue as a surface residue. To address this issue, a set of surface residues is randomly selected to make the ratio of positive and negative data 1:1. Fivefold cross-validation is then used to evaluate the SVM. The whole dataset is ran-domly divided into five subgroups with an approximately equal number of chains. Each SVM runs five times with five different training and test sets. For each run, three of the subsets are used as the training set, one subset is used to select the optimal parameters and the remaining one is used as the test set.

Performance metrics
The following measures are used to evaluate the performances: precision, recall, accuracy, F1 and correlation coefficient (CC), which are defined as follows: Precision TP TP FP = + (8) Recall TP TP FN = + The flowchart of generating binary profiles Figure 4 The flowchart of generating binary profiles. The multiple sequence alignment is obtained by PSI-BLAST. The frequency profile is calculated from the multiple sequence alignment and converted to a binary profile with a frequency threshold. The substring of amino acid combination is then collected.
where TP is the number of true positives (interface residues correctly classified as interface residues), FP is the number of false positives (surface residues incorrectly classified as interface residues), TN is the number of true negatives (surface residues correctly classified as surface residues) and FN is the number of false negatives (interface residues incorrectly classified as surface residues).
Precision, recall and F1 are used to measure the performance of classifying interface residues, while accuracy is used to measure the performance of classifying the whole test dataset. Correlation coefficient (CC) is applied to measure the correlation between predictions and actual test data.