Constructing patch-based ligand-binding pocket database for predicting function of proteins

Background Many of solved tertiary structures of unknown functions do not have global sequence and structural similarities to proteins of known function. Often functional clues of unknown proteins can be obtained by predicting small ligand molecules that bind to the proteins. Methods In our previous work, we have developed an alignment free local surface-based pocket comparison method, named Patch-Surfer, which predicts ligand molecules that are likely to bind to a protein of interest. Given a query pocket in a protein, Patch-Surfer searches a database of known pockets and finds similar ones to the query. Here, we have extended the database of ligand binding pockets for Patch-Surfer to cover diverse types of binding ligands. Results and conclusion We selected 9393 representative pockets with 2707 different ligand types from the Protein Data Bank. We tested Patch-Surfer on the extended pocket database to predict binding ligand of 75 non-homologous proteins that bind one of seven different ligands. Patch-Surfer achieved the average enrichment factor at 0.1 percent of over 20.0. The results did not depend on the sequence similarity of the query protein to proteins in the database, indicating that Patch-Surfer can identify correct pockets even in the absence of known homologous structures in the database.


Background
An increasing number of protein structures of uncharacterized proteins have been solved by structural genomics projects. As of June, 2011, there are 3321 structures of unknown function in the Protein Data Bank (PDB). Elucidating function of these proteins is an importation task for bioinformatics. To predict protein function from structure, we have recently developed an alignment free local pocket surface comparison method for predicting the type of ligand that is likely to bind to a query protein [1]. The algorithm, named Patch-Surfer, represents a binding pocket as a combination of segmented surface patches, each of which is characterized by its shape, the electrostatic potential, the hydrophobicity, and the concaveness. A query pocket, represented as a group of patches, is compared with a database of pockets of known binding ligand molecules, and binding ligand prediction is made by summarizing similar pockets retrieved from the database. Representing a pocket by a set of patches was shown to be effective in tolerating difference in global pocket shape while capturing local similarity of pockets. The shape and the physicochemical property of surface patches are represented using the 3D Zernike descriptor (3DZD), a series expansion of mathematical 3D function. In this work, we constructed a large database of ligand binding pockets, which contains a diverse set of pockets. We evaluated the performance of Patch-Surfer on the database in terms of the enrichment factor of correct ligand binding pockets retrieved from the database for query pockets.

Methods
The Patch-Surfer method for binding ligand prediction Here we briefly describe Patch-Surfer algorithm. Please refer to the original paper for more details [1]. Given a query protein structure, the surface is computed and a pocket region is extracted. If the binding pocket of the protein is not known, we can predict it using a protein pocket detection algorithm, such as Visgrid [2]. The pocket is segmented to surface patches where four features of each patch, geometrical shape, the surface electrostatic potential, the hydrophobicity, and the concaveness [2], are encoded with 3DZD for efficient storage and comparison. Thus, a pocket is represented by a set of surface patches [1,3]. The 3DZD is a series expansion of a 3D function, which allows compact and rotationally invariant representation of a 3D object (i.e. a 3D function) [4]. To compute the 3DZD for a patch, a patch is mapped on a 3D grid and grid points that overlap with the patch are marked with either 1 (for indicating the geometrical shape) or physicochemical values to represent. The assigned values in the 3D grid are considered as a 3D function, f(x), which is expanded into a series in terms of Zernike-Canterakis basis defined as follows: where -l < m <l, 0 ≤ l ≤ n, and (n-l) even. Y m l (ϑ, φ) are the spherical harmonics and R nl (r) is the radial function constructed in a way that Z m nl (r, ϑ, φ) can be converted to polynomials in the Cartesian coordinates, Z m nl (x) . To obtain the 3DZD of f(x), first 3D Zernike moments are computed: Then, the 3DZD, F nl , is computed as norms of vectors Ω nl . The norm gives rotational invariance to the descriptor: n defines the range of l and a 3DZD is a series of invariants (Eqn. 3) for each pair of n and l, where n ranges from 0 to the specified order. We use order n = 15 (72 invariants) in the local surface patch comparison. The shape and the concaveness are represented by a vector of 72 invariant values while vectors for the electrostatic potential and the hydrophobicity have 144 invariants.
Next, the query pocket is compared to known pockets stored in the database. In the database, each pocket is also represented as a set of surface patches. For example, ATP binding pockets are represented with, on average, 29.5 patches. Given the query pocket and a pocket in the database, the pocket comparison process first identifies similar patches between the two pockets using a modified bipartite matching algorithm. Two options were tested for the matching stage: the first approach matches all patches while the other approach matches only patches that are more similar than the predefined distance threshold value. The similarity of the two pockets is measured with linearly combined scoring terms between the matched patches.

Constructing database of representative ligand binding pockets
Representative pockets are selected as follows. A list of 5,438 non-redundant protein structures complexed with ligand molecules extracted from PDB was obtained from the Protein-Small-Molecule DataBase http://compbio.cs. toronto.edu/psmdb/downloads/CPLX_25_0.85_7HA.list [5]. From this list, first, we removed all ligands that consist of less than 7 heavy atoms. Then, two ligands which bind to the same protein were grouped together if a pair of atoms, one from each ligand, are closer than 4.0 Å. We further filtered out ligands that are closer than 1.4 Å to the protein, because they bind covalently to proteins. Also, ligand molecules that are more distant than 3.5 Å to any of the protein heavy atoms were removed, as they are not physically interacting with the protein.
Finally, we obtained 9,393 pockets structures which bind 2707 different types of ligand molecules.

Obtaining weighting factors for scoring function
The distance between patch A in the query pocket and patch B in a pocket in the database is defined as: where L2 is the L2 norm (the Euclidian distance) between the 3DZDs of patch A and B in terms of the surface property t, which is either the geometrical shape, hydrophobicity, the surface electrostatic potential, or the concaveness [2] of the patch. w t B is the weighting factor for the property t, which depend on the patch B from the database. These weights for each patch in each ligand molecules were computed using the average (avg) and the standard deviation (std) of the Euclidian distance of the patches at the equivalent position (i.e. patches whose closest ligand atom are the same) in the same ligand binding pockets in the database. Weight of a patch P for surface property t {shape, hyd, ele, con} is defined as follows: The average and the standard deviation are used to normalize the difference in the distribution of the four properties.

Enrichment factor
We used the enrichment factor to evaluate how well Patch-Surfer retrieves binding pockets of the same binding ligand for query pockets. The enrichment factor (EF) describes the ratio of correctly retrieved pockets relative to the percentage of the database entries scanned: where T P is the total number of pockets that bind the ligand type P in the database of the size T DB , N x P is the number of pocket for the ligand type P ranked within the top x percent by the database search method (Patch-Surfer) and N x is the total number of retrieved pockets ranked in the top x percent of the database.

Pocket retrieval results
Patch-Surfer was run with six different settings: using all the four properties or using only the shape information combined with three different distance thresholds for matching patches, 0.2, 0.3, and no threshold for the patch distance (Eqn. 4). Using the threshold value of 0.2, only similar surface patches with the distance closer than 0.2 are matched while the no threshold option matches the maximum number of pairs between two pockets regardless of their distance (i.e. all the patches in the smaller pocket are matched to patches in the larger pocket). The results ( Figure 1A) show that first, using all the four properties showed better EF than just using the shape information, and second, using the threshold value of 0.2 performed best among the three choices tested for the distance threshold. The best retrieval was observed when all the patch properties and the threshold of 0.2 were used. Figure 1B shows the EF of each ligand types using Patch-Surfer with the threshold distance of 0.2 and all the four properties. The HEM and the FAD showed very high EF values of over 30 at early ranks. Patch-Surfer performed relatively poorly for GLC. The reason for this is that there are twenty other ligands that are similar to GLC in the database, according to the Tanimoto coefficient (higher than 0.85).    Effect of the sequence identity to the enrichment factor In Figure 2 we show the EF (at 1.0%) of each of the 75 query proteins relative to the sequence identity between the query proteins to the proteins in the database that bind to the same ligand molecules. The correlation coefficient between the average sequence identity to their EF values is 0.05. The plot clearly shows that there is no dependency between the sequence identity and the EF values. Patch-Surfer can retrieve binding pockets of the same ligand type even without having highly similar proteins in the database. Figure 3 shows three examples of the query and database protein pairs that were ranked at the 1 st in the retrieval by Patch-Surfer. The three pairs bind AMP, FAD, and NAD, respectively. The sequence identity between the AMP binding proteins ( Figure 3A) is only15.8%, FAD binding proteins ( Figure 3B) has the sequence identity of 16.8%, and NAD binding protein pairs ( Figure 3C) has the sequence identity of 15.3%. The pairs of proteins have different overall backbone structure, thus methods that compare global protein structure or the global pocket shape would not capture their similarity.

Conclusions
We constructed a large database of representative ligand binding pockets for Patch-Surfer. The sufficiently high EF achieved by Patch-Surfer shows that the method is able to retrieve pockets of the same binding ligand from the large database even in absence of homologous proteins in the database. We are currently building a web server for easy access to Patch-Surfer.