Computing the protein binding sites

Background Identifying the location of binding sites on proteins is of fundamental importance for a wide range of applications including molecular docking, de novo drug design, structure identification and comparison of functional sites. Structural genomic projects are beginning to produce protein structures with unknown functions. Therefore, efficient methods are required if all these structures are to be properly annotated. Lots of methods for finding binding sites involve 3D structure comparison. Here we design a method to find protein binding sites by direct comparison of protein 3D structures. Results We have developed an efficient heuristic approach for finding similar binding sites from the surface of given proteins. Our approach consists of three steps: local sequence alignment, protein surface detection, and 3D structures comparison. We implement the algorithm and produce a software package that works well in practice. When comparing a complete protein with all complete protein structures in the PDB database, experiments show that the average recall value of our approach is 82% and the average precision value of our approach is also significantly better than the existing approaches. Conclusions Our program has much higher recall values than those existing programs. Experiments show that all the existing approaches have recall values less than 50%. This implies that more than 50% of real binding sites cannot be reported by those existing approaches. The software package is available at http://sites.google.com/site/guofeics/bsfinder.


Background
Identifying the location of binding sites on proteins is of fundamental importance for a wide range of applications including molecular docking, de novo drug design, structure identification and comparison of functional sites. Structural genomic projects are beginning to produce protein structures with unknown functions. Therefore, efficient methods are required if all these structures are to be properly annotated.
Many methods have been proposed for identifying the location of binding sites on proteins. Laurie and Jackson give an energy-based method for the prediction of protein-ligand binding sites [1]. Bradford and Westhead combine a support vector machine (SVM) approach with surface patch analysis to predict protein-protein binding sites [2]. Chen et al. develop a tool, 3D-partner, for inferring interacting partners and binding models [3]. 3D-partner first utilizes IMPALA to identify homologous structures (templates) of a query protein sequence from heterodimer profile library. The sequence profiles of those templates are then used to search interacting candidates of the query from protein sequence databases by PSI-BLAST. Lo et al. develop a method for predicting helix-helix interaction from residue contacts in membrane proteins [4]. They first predict contact residues from sequences. Their relationships are further predicted in the second step via statistical analysis on contact propensities and sequence and structural information. Li et al. propose an approach for finding binding sites for groups of proteins [5]. It contains the following steps: finding protein groups as bicliques of protein-protein interaction networks (PPI), identifying conserved motifs, and searching domain-domain interaction databases. Liu et al. extend the method of Li et al. in [5] and consider comparing 3D local structures [6]. Guo and Wang identify the binding sites by finding two similar 3D substructures [7].
SiteEngine is a method that recognizes the regions on the surface of one protein that are similar to the binding sites of another. It uses geometric hashing triangles to transfer the input sites into the recognized region [8]. SuMo is a system for finding similarities in arbitrary 3D structures or substructures of proteins. It is based on a unique representation of macromolecules using selected triples of chemical groups [9]. The web server pdbFun analyzes the structure and function of proteins at the residue level [10]. When comparing a complete protein with all complete protein structures in the PDB database, experiments show that all the existing approaches have recall values less than 50% implying that more than 50% of real binding sites cannot be reported by those existing approaches.
In this paper, we design a method to recognize regions of binding sites on the proteins. It consists of three steps: local sequence alignment, protein surface detection, and 3D structures comparison. Experiments show that the average recall value of our approach is 82% and the average precision value of our approach is also significantly better than the existing approaches.

Methods
Given two complete protein structures, our task is to find the binding sites on the given proteins. Our method contains three steps.
Step 1, we do local sequence alignment at the atom level to get the alignments of conserved regions. These alignments of conserved regions may contain some gaps. Step 2, among the conserved regions obtained in Step 1, we use the 3D structure information to identify the surface segments. Step 3, for any pair of the surface segments identified in Step 2, we compute a rigid transformation to compare the similarity of the substructures in 3D space and output the qualified pairs as binding sites.
Step 1: Local sequence alignment In PDB format files, each residue (amino acid) is represented in the traditional order of atom records N, CA, C, O, followed by the side chain atoms (CB, CG1, CG2 ...) in order first of increasing remoteness, and then branch. The whole protein sequence of residues can be translated into a sequence of atoms based on this representation. The sequences of binding sites on the proteins are usually conserved at the atom level. When looking at the SitesBase [11], we can see that the pair of binding sites form a conserved region that are well aligned at the atom level, where atoms of the same types are matched and all the unmatched atoms correspond to gaps. Figure 1 is the result of SitesBase for proteins 1TU4D and 5P21A.
We use the standard Smith-Waterman's local alignment algorithm [12] to find the conserved segments, where a matched pair of atoms of the same type has a score 1, a mismatched pair of atoms of different types has a score -∞, a mismatch between an atom and a space has a score -2. The local alignment algorithm can return a set of conserved segments in the alignment of the protein sequences of atoms.
We have done many experiments and found that the set of conserved segments output by the local sequence alignment algorithm always contains the pairs of binding sites in the SitesBase. The only problem is that the local sequence alignment algorithm outputs too many matched atoms. Next, we will further reduce the matched atoms. After obtaining the set of conserved segments from the local sequence alignment, we focus on the columns with identical pairs of atoms and ignore the rest of columns in the following steps.
Step 2: Identifying surface segments Inspired by the work in [13], we propose the following method to find surface segment of proteins. First, the protein is projected onto 3D grid in the Euclidean space. For the grid, we use a step size of 1Å. Second, grid points are marked as interior, surface or empty. A grid point is marked as protein if the point is within 2Å distance of an atom in the protein. A grid point is marked as empty if it is not protein point. A grid point is marked as interior if all its six neighbor grid points are protein points. A grid point is marked as surface if at least one of its six neighbor grid points is not protein point. An atom in the protein is a surface atom if it is within distance 1.5Å of a surface point. Figure 2 gives an example, where the dark grid points are surface points.
For a conserved segment output by the local sequence alignment algorithm, we consider all its subsegments containing at least 15 matched pairs of atoms. For such a subsegment, if both sequences on this subsegment have at least 2/3 atoms as the surface atoms, we treat such a subsegment as a candidate binding site for further processing in the next step.
Step 3: Computing rigid transformations to match candidate binding sites For any candidate binding sites obtained from Step 2, we will further test if the pair of 3D substructures can match well on such a site. Precisely, we can find the set of subsegments in a given segment with alignment A using the following rule: there exists a rigid transformation such that the distance between each pair of atoms in the same column of the subsegment is at most d, where d is a parameter given by the user. A rigid transformation is a transformation for protein 3D structure in the 3D space that preserves distances between any pair of points in the structure of protein. This requires us to solve the following protein 3D structure matching problem: Input: A segment with sequence alignment A of two proteins, where each position in the alignment has two identical atoms, the 3D coordinate of each atom in the alignment, and a threshold d.
Goal: Find a set of subsegments with alignment A such that for each output subsegment the Euclidean distance between each pair of atoms in the same column is at most d.
The protein 3D structure matching problem can be solved in several ways. Here we use the method in [14] which is a faster version of the method in [15] to solve the problem. The method in [14] can compute a rigid transformation such that the distance between each matched pair of atoms is at most (1+Î)d, where Î = 0.1 is a parameter to control the precision of the transformation. This is just an approximate rigid transformation, and it is good enough in practice.

Testing the overlap of the proteins in 3D space
When computing the rigid transformation, we also require that the proteins do not overlap under the transformation. For each rigid transformation that can match the substructures of the candidate subsegment, we test if the proteins have overlap in 3D space under such a transformation as follows: 1. Construct the grid in 3D space and mark each grid point as interior, surface or empty as in Step 2 with respect to each of the given proteins.
2. Let X be the number of grid points that are interior points for both proteins, X 1 and X 2 be the number of interior points of the first protein and the second protein, respectively. If X ≤ 0.05 × min{X 1 , X 2 }, then we say that there is no overlap between proteins under the current rigid transformation and we output the matched substructures as the predicted binding sites.

Comparison with existing methods
In this section, we compare our program BsFinder with three existing programs SiteEngine, SuMo, and pdbFun. They use different methods to predict the binding sites of given proteins. SiteEngine [16] is a method that recognizes the regions on the surface of one protein that are similar to the binding sites of another, and geometric hashing triangles are used to transfer the input sites into the recognized region [8]. SuMo [17] is a system for finding binding sites onto query structures, by comparing the structure of triplets of chemical groups against the binding sites found in PDB database [9]. The web server pdbFun [18] locates binding sites in proteins at the residue level, and it analyzes structural similarity between any pair of residue selections [10].
To compare BsFinder with the three existing systems, we use the proteins in PDB database, and select 55 proteins to compare with the whole database. Note that the Structural Classification of Proteins (SCOP) database [19] in [20] aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structures are known. It provides 11 classes to separate all known protein folds. Each class contains several different families. We choose 5 proteins from each class in different families such that there is only one entry from each family. Since BsFinder allows users to give the value of d, we set the threshold d = 1.5Å and output the matched sites with at least 15 atoms.

Evaluation of prediction
To calculate the precision and recall value for each approach, we need to know which pair of binding sites output by the programs is real. Here we look at SitesBase [21] in [11], which holds the set of known binding sites found in PDB. The precision value is defined as the number of sites output by the program that are confirmed in SitesBase divided by the total number of sites output by the program, where a output site is confirmed in SitesBase if at least two residues of the output sites are the same as the binding sites in SitesBase. As the sites output by SuMo are very short, the sites output by SuMo are confirmed if each one has at least one residue which is identical to that in Sites-Base. Ideally, all the sites output by the program are confirmed in SitesBase, in the case, the precision value is 100%. Apparently, the larger the precision value is, the better the program is. The recall value is defined as the number of sites output by the program that are confirmed in SitesBase divided by the total number of binding sites more than two complete residues for given proteins in SitesBase. If all the binding sites for given proteins in the SitesBase can be output by the program, then the recall value is 100%. Again, the larger the recall value is, the better the program is.
We use the 55 selected proteins to compare with the whole PDB database. The results are shown in Table 1. The average numbers of the sites output by BsFinder, SiteEngine, SuMo, and pdbFun are 6425, 6003, 6329, and 1936, respectively. On average, pdbFun reports the smallest number of sites and the other three systems output approximately the same number of sites. The average numbers of the confirmed sites output by BsFinder, SiteEngine, SuMo, and pdbFun are 2218, 1265, 674, and 281, respectively. See Figure 3(a).
The precision and recall values for 55 proteins output by four programs are shown in Table 1. Apparently, BsFinder has the largest precision and recall values for most of the cases. On average, the precision value of BsFinder is 34% while the precision values for SiteEngine, SuMo, and pdbFun are 21%, 11%, and 15%, respectively. The average recall value of BsFinder is 82% while the average recall values for SiteEngine, SuMo, and pdbFun are 47%, 25%, and 11%, respectively. See Figure 3(b). The value of recall is very important in practice. From the experiment results, we know that the existing programs have lower values of recall.
The possible reasons that our method can get better results might be (1) we use the surface information, (2) we look at the similarity of two local 3D substructures in terms of rigid transformation while the previous methods use triples of atoms or pairs of amino acids and (3) the volumes of the protein molecules are considered when the rigid transformation is computed.

Comparison of running time
To compare the running time of different programs, we use a Pentium(R) 4 (CPU of 2.40 GHz) to run all four programs. Based on 55 selected proteins, the average running times of BsFinder, SiteEngine, SuMo, and pdbFun for comparing each given protein with the whole PDB database are roughly 50 minutes, 70 minutes, 30 minutes, and 5 minutes, respectively. See Table 2. Thus, BsFinder is the second slowest program. However, it is still faster than SiteEngine which has the highest average values of precision and recall among the three existing programs.

Performance of programs for different families
To see the performance of programs for different protein families, we look at three different families (G proteins family in P-loop folds, PYP-like family in Profilin-like folds, and FAD-linked reductases family in FAD/NAD (P)-binding folds) and select five proteins from each family. The average numbers of matched sites output by BsFinder for three families are 7680, 5289, and 7892, respectively. The average numbers of confirmed sites for three families are 3487, 1132, and 4138, respectively. The average precision values for three families are 45%, 21% and 53%, respectively. The average recall values for three families are 94%, 60% and 96%, respectively. The results are shown in Figure 4.

G proteins family in P-loop folds
We select 5 proteins (1A2B, 1CXZ, 1DPF, 1FTN, 1S1C) from G proteins family in P-loop folds. The results are shown in Table 3. The precision values of BsFinder (48%, 46%, 43%, 42% and 47%) are larger than those of other three programs. The recall values of BsFinder (95%, 93%, 92%, 91% and 99%) are more than 90%, while the recall values of the other three programs are almost less than 40%.
A Case: We compare two proteins 4VHBA and 1CQXA. The cartoon version of the protein 3D structures are shown in Figure 5, and the matched parts of structures are shown as the sticks fashion. BsFinder finds a rigid transformation that matches residues 84-86 from 4VHBA to residues 84-86 from 1CQXA, residues 95-100 from 4VHBA to residues 95-100 from 1CQXA, and residues 125-128 from 4VHBA to residues 125-128 from 1CQXA. See Figure 6. The three pairs of matched sites are confirmed in SitesBase. Note that these three pairs can be matched under one rigid transformation simultaneously.

Searching similar binding sites
BsFinder can use a binding site to search the similar sites in the protein structures database. SiteEngine can search a given functional site on a large set of complete protein structures. SuMo can search for the given 3D site of interest among the structures of the PDB. PAST [22] is a web service based on an adaptation of the generalized suffix tree and relies on a linear representation of the protein backbone [23]. PAST can find the functional sites from the protein structures database similar to the given binding site. We randomly select the 100 binding sites with different types from the SitesBase and search the whole PDB database. The average numbers of the sites output by BsFinder, SiteEngine, SuMo, and PAST are 274, 266, 399, and 281, respectively. The average numbers of the confirmed sites output by BsFinder, SiteEngine, SuMo, and PAST are 106, 73, 72, and 58, respectively. See Figure 7(a). BsFinder finds a relatively smallest number of output sites, and the number of confirmed sites output by BsFinder is the biggest. Apparently, BsFinder has the largest precision and recall values for most of the cases. On average, the precision value of BsFinder is 39% while the precision values for SiteEngine, SuMo, and PAST are 27%, 22%, and 24%, respectively. The average recall value of BsFinder is 86% while the average recall values for SiteEngine, SuMo, and PAST are 58%, 51%, and 45%, respectively. See Figure 7(b).

The gaps in binding sites
In the first step of our algorithm, we do sequence alignment where each letter is an atom. This allows the matched sites to have some missed atoms, and each missed atom represents one gap in the binding sites.
Step 1 is very important for predicting binding sites on proteins. Among the output sites, 67127 of them do not contain any gap, 63593 contain one gap, 77725 contain two gaps, 81259 contain three gaps, 38863 contain four gaps, 21198 contain five gaps and 3533 contain more than five gaps. The gap distribution of the confirmed The first number is the number of output sites reported by the program, the second number is the number of confirmed sites reported by the program; ‡ The first number is the precision value (%) for the program, the second number is the recall value (%) for the program;    The first number is the number of output sites reported by the program, the second number is the number of confirmed sites reported by the program; ‡ The first number is the precision value (%) for the program, the second number is the recall value (%) for the program; The first number is the precision value (%) for the program, the second number is the recall value (%) for the program; The first number is the number of output sites reported by the program, the second number is the number of confirmed sites reported by the program; ‡ The first number is the precision value (%) for the program, the second number is the recall value (%) for the program; sites are 18285 (no gap), 19504 (one gap), 26809 (two gaps), 26809 (three gaps), 15847 (four gaps), 12197 (five gaps) and 2452 (more than five gaps). The confirmed sites have higher proportion of the four or more gaps among all output sites reported by BsFinder.

The power of surface detection
In Step 2 of our algorithm, we identify the surface atoms in the given proteins and rule out the substructures in which less than 2/3 of atoms are the surface atoms for further calculation of the rigid transformation. To demonstrate the effect of Step 2, we compare the final version of BsFinder with the version without Step 2. By adjusting the parameters, the final version of BsFinder has improved precision value while the recall value remains essentially unchanged. The average precision values for BsFinder without Step 2 and the final version of BsFinder are 29% and 34%, respectively. The average recall values for BsFinder without Step 2 and the final version of BsFinder are 83% and 82%, respectively. Therefore, by doing Step 2 the precision value can be improved by about 5%. This is a significant improvement.

Conclusions
We have developed a program for finding binding sites on the given proteins. Our method uses the 3D structure information to detect the similar surface regions. Experiments show that our program outperforms all existing programs.