BSSF: a fingerprint based ultrafast binding site similarity search and function analysis server
© Xiong et al. 2010
Received: 16 July 2009
Accepted: 25 January 2010
Published: 25 January 2010
Skip to main content
© Xiong et al. 2010
Received: 16 July 2009
Accepted: 25 January 2010
Published: 25 January 2010
Genome sequencing and post-genomics projects such as structural genomics are extending the frontier of the study of sequence-structure-function relationship of genes and their products. Although many sequence/structure-based methods have been devised with the aim of deciphering this delicate relationship, there still remain large gaps in this fundamental problem, which continuously drives researchers to develop novel methods to extract relevant information from sequences and structures and to infer the functions of newly identified genes by genomics technology.
Here we present an ultrafast method, named BSSF(Binding Site Similarity & Function), which enables researchers to conduct similarity searches in a comprehensive three-dimensional binding site database extracted from PDB structures. This method utilizes a fingerprint representation of the binding site and a validated statistical Z-score function scheme to judge the similarity between the query and database items, even if their similarities are only constrained in a sub-pocket. This fingerprint based similarity measurement was also validated on a known binding site dataset by comparing with geometric hashing, which is a standard 3D similarity method. The comparison clearly demonstrated the utility of this ultrafast method. After conducting the database searching, the hit list is further analyzed to provide basic statistical information about the occurrences of Gene Ontology terms and Enzyme Commission numbers, which may benefit researchers by helping them to design further experiments to study the query proteins.
This ultrafast web-based system will not only help researchers interested in drug design and structural genomics to identify similar binding sites, but also assist them by providing further analysis of hit list from database searching.
Over the past decade, a significant proportion of the protein structures deposited in the Protein Data Bank (PDB) have come from the advanced high-throughput methods of various structural genomics initiatives[1, 2]. Although the themes of these funded post-genomic projects differ at many levels, there is a central problem which is: how these sequences and structures are related to their functions. By increasing the structural repertoire, it is hoped that we will be able to improve our understanding of fold space and how proteins evolve new functions[3, 4]. This structure-function relationship is especially essential in the drug design field where researchers try to target specific proteins involved in disease mechanisms by using structure-based drug design methods [5–7].
The number of in silico methods which can infer protein functions has grown enormously in the recent years. They can be categorized as sequence-based and structure-based. Powerful BLAST-like sequence-search methods are able to transfer the function of a well-defined protein family to a protein with a high sequence similarity. For lower sequence similarity instances, more subtle methods such as "profile" or "hidden Markov models" can be constructed from multiple sequence alignments and applied to find obscure patterns in the protein sequences, thus assigning a function to them [10–12]. All of these algorithms above assume that similar sequences are derived via divergent evolutions. However, this is limiting, as demonstrated by numerous studies showing that evolution in structure space is much more conserved than in sequence space. This has spurred researchers to develop methods to infer functions directly from structural information. Many structure fold/domain classification databases have been compiled in efforts to build a basis for such algorithms [14–16]. Although they have assisted researchers in assigning functions to proteins with similar folds, these fold information based methods also have their drawbacks, as it is shown that sequence/structure can also dynamically change due to convergent evolution. This is exemplified by many proteins involved in metabolism pathways, which although they do not have any fold similarity, all process similar metabolites. One explanation could be that they have a similar spatial arrangement of key residues in their catalytic binding sites. Such limitation further encourages researchers to develop methods based on key residues or local structural motifs.
Searching for similar local spatial patterns in structure databases is an especially challenging task, since it involves a large searching space and is usually time-consuming. Given its importance both in basic biology research and drug design, several algorithms have been devised to tackle this difficult problem of finding similar functional sites in structures [18–24]. Among these local spatial similarity detection methods, some rely on the curated local structure patterns, like in TESS and SPASM systems, and identify the similar local structures by comparing these curated structure templates with the query. Others are more flexible and able to take many structures into account, then build the structure patterns during the process and detect the similar local structures on the fly as exemplified by Hamelryck's multidimensional index tree method. These methods not only give hints about structure evolution and protein functions, but also play a significant roles in predicting drug side-effects caused by ligand cross binding to similar surface patches on various target structures. At the fundamental level, the elementary searching algorithms used in these methods are usually based on geometric hashing and graph clique detection. Due to the significant computational expenses of these algorithms, they typically rely on previously calculated data sets and it would be nontrivial to apply them within the whole structure space represented by the PDB. Users can only visualize and analyze results stored in a pre-compiled database. While as a consequence of structural genomics research, there is a continuously increasing demand for performing binding site similarity searches with user input structures in order to find possible functions for these binding sites as well as for the proteins in a large scale and fast way.
Here we present a new ultrafast binding site similarity search method along with a simple analysis of occurrences of Gene Ontology (GO) term and Enzyme Commission (EC) numbers in hit list. Our method was inspired by the fingerprint and pharmacophore concepts commonly used in chemoinformatics. We first systematically extracted possible binding sites from protein structures, mapped pharmacophore properties to the binding sites and calculated fingerprints for later database searches. To enable users to identify subtle similarity between binding sites, a panel of fingerprint measurements was tested to find the optimal solution. Further, a statistics-based score method was developed to evaluate hits with the aim of eliminating false positives. This novel binding site similarity search method should enable researchers in the structural biology field to examine in detail for possible binding sites in an ultrafast large scale way. Also it will benefit researchers in the field of drug development by allowing them to predict and investigate possible side effects due to ligand cross binding events.
First, a total of 41449 protein structures were gathered from the RCSB PDB (up to 2008.1). Then, a geometry-based protein binding site detection method, named PASS (Putative Active Sites with Spheres), was adopted to extract possible binding sites from every polypeptide chain in the database of protein structures. This tool is able to characterize concave regions of a protein surface and to identify positions likely to represent binding sites based upon the size, shape, and burial extent of these volumes. The output spherical probes of PASS were processed with an in-house program based on a minimum spanning tree algorithm to split the probes around the protein surface into clusters at least 5 Å apart. This resulted in 201,233 probe clusters, which were further filtered so that only probe clusters containing 30 to 200 probes were considered as binding sites. These feasible binding sites were used to extract protein residues within 6 Å of these probes. These binding site probes along with the identified binding site residues were stored into the database for later visualization and fingerprint calculation. Two other datasets which compiled with known small molecular binding sites were also constructed for a better evaluation of our method. First, the HETATM residues in PDB database were filtered such that only the ones which have molecular weights between 100 and 800 were retained. Also, if one residue shows up for more than 10 times in PDB database, then it will be regarded as a common molecule and discarded. Finally, 13227 PDB structures were subjected as a database for later geometric hashing and fingerprint calculation. In geometric hashing calculation, only the backbone CA atoms and centroids of functional fragments of side chains of binding site were retained. (See the additional file 1, Table S1 for definitions of fragment types). We named this dataset GH Validation Dataset. The other dataset is FP Validation Dataset, which use the known binding sites of these 13227 PDB structures to calculate the fingerprint with the method describe below.
which denotes the dissimilarity score between fingerprint i and j, both of which have n bins.
Thus, a comparable Z-score is available for selecting similar binding sites from the fingerprint database.
The goal of structural biology is to investigate and understand protein functions through three dimensional structures. Proteins execute their functions via binding to other cellular components such as ligands. The contact points located on the protein surface are commonly known as binding sites. Although it still remains a challenge to identify binding sites solely from a three-dimensional structure, several computational methods have been developed to detect such spatial motifs. One of them, PASS, is a binding site detection program based on an analysis of the geometric features of the protein surface. Due to the great difficulty of identifying biologically meaningful binding sites, we decided to gather all possible binding sites for the analysis. At first, a total of 41449 structures were retrieved from the PDB database and filtered to remove all non-standard amino acids. Next, each chain containing more than 100 amino acids was saved as a file to be used in binding site detection using the PASS program, which resulted in 201233 possible binding sites, or roughly two binding sites per polypeptide chain. A detailed analysis of these possible binding sites shows that the average size of a binding site is 30 amino acids (summarized in additional file 1, table S3). Mapping the binding site residues to pharmacophore fragments identifies about 88 fragments per binding site. By analyzing the pharmacophore type distribution in the binding sites, it is clearly shown that both hydrogen bond donor/acceptor and aromatic fragments are enriched in the binding sites compared to the whole proteins, while not the lipophilic pharmacophores.
Due to the difficulty of obtaining a gold standard benchmark of similar binding sites in a meaningful scale, we use simulated datasets described below as a control to identify the most appropriate measurement for those fingerprints containing modest similarity. We first grouped the binding sites in the database according to their fragment numbers, denoted as Group n (n is the fragment number in the binding site). For each Group 5×i (i = 6, 7, 8, ..., 40), we selected 10 binding sites randomly. Two synthesized binding site datasets were created with the following strategies.
For each binding site S (with fragment number Num s ), we created 100 new structures denoted as (k = 1,2,3, ..., 100). In each , the previous structure of S was kept and random points were added with the follow procedure:
Fragment types are selected randomly from 1 to 7;
Two farthest points in binding site are located, and then two spheres of 8 Å radius are created centred by these two points;
Randomly set point coordinates inner the spheres until certain number of points have been got, every point should be separated by at least 3.5 Å.
This dataset is derived from Dataset I. The atom types in are re-defined randomly while the coordinates of binding site structures are retained. We consider this dataset as a random dataset.
Although the Canberra Distance scoring method outperforms other fingerprint scoring methods in this simulation, it was clearly not feasible to separate similar binding sites from random ones if the binding sites were in different groups. This spurred us to devise a Z-score function to correct the similarity score calculated for different binding sites.
The Z-score can then be computed with Equation 3.
To evaluate the Z-score performance, Dataset I and II were used again to recalculate the Z-score for them. It was found that the Z-score for the similarity of the random binding sites were around 2.5, except for the case of very small binding sites. This should allow users to better judge the results from database search.
The TPR represented the sensitivity while the FPR is the 1-specificity.
As demonstrated in Figure 5, it is not surprising that the classification ability is limited in some cases. Although at lower Z-score cut-offs (from -5 to -2) the TPR is dominant, the similarity detection ability falls off as long as the Z-score cut-off rose up. To take account of the effects of added points on Z-score performance, we also randomly chose 1000 binding sites from Dataset I and converted binding sites with an added point percentage larger than 80% to the false group. Combined with another 1000 randomly chosen binding sites from Dataset II, the ROC curve was calculated again and plotted as the 80% line in Figure 5. Similarly, we calculated 60%, 50%, 40% and 30% lines and plotted them in Figure 5. Clearly and intuitively, the power of the classifier increases as the more subtly similar binding sites are assigned to the false group. From the ROC curve, it was illustrated that even when we added 50% random fragments to the binding sites, the Z-score scheme was still able to give encouraging results. With a Z-score cut-off of -1.5, the true positive rate is about 50% and the false positive rate is only 10%.
To validate our fingerprint scoring strategy, a geometric hashing based similarity measurement was implemented for a comparison. Geometric hashing is a well known sequence-independent 3D similarity searching method, which has been adopted as a basis by web servers like SitesBase. 450 PDB entries were randomly selected from known binding site dataset. Then they were searched by both geometric hashing and our fingerprint scoring method, against GH and FP Validation Dataset respectively. Totally 431 PDB binding sites were successfully processed by both methods.(All the validation results and softwares can be accessed at our web site http://220.127.116.11:8080/bssf/validate_exp.jsp and http://18.104.22.168:8080/JChem/li/indexGH.jsp).
The output from geometric hashing method was categorized into three levels of similarities according to the percentages of matched points with the query (>1/3,>1/2 and > 2/3). In this test, 11121, 4309, 1568 pairs of similar PDB entries were found at 1/3, 1/2 and 2/3 levels respectively.
We also defined a manner to classify the results from fingerprint based method. For each query, first, we counted PDB entries which were found as similar by geometric hashing method in certain level, this number was set as Nsim. Then the searching result of fingerprint method was sorted by the Z-scores from lowest to highest. After we get the numbers, the sorted list of fingerprint search result will be truncated so that only the first Nsim × Nfold (Nfold = 1,2,3,4,5) entries will be kept. Then the truncated list was investigated by counting the PDB entries(Nfound) which were considered as similar in geometric hashing measurements. Finally, the number Nfound/Nsim was used as a success ratio to assess the fingerprint Z-score strategy.
Taking a close look at the results from geometric hashing and fingerprint methods, it is clear that they can complement each other. For example, According to the PDB annotation, the entry 3H4A is E.coli 6-Hydroxymethyl-7,8-dihydropterin pyrophosphokinase (HPPK) complexed with AMPCPP. From the geometric hashing searching, total 11 PDB entries were found having similarities better than 1/3 level (Matched points more than 1/3 of query binding site). Blast searching the sequence of 3H4A against these 11 PDB also shows that all of them have Blast E-value which are lower than 1E-50. This demonstrated our geometric hashing procedure is capable to find the similar binding sites. Interestingly, after investigating the fold two truncated list from fingerprint method, we found this list not only covers the 11 entries found above, but also includes some other entries, like 1CBK and 2BMB (which have low blast E-values 4E-47 and 1E-17 with 3H4A), 1C85, 1C86 and 1C88 (all of which are protein-tyrosine phosphatase 1B). All of these extra entries have low fingerprint raw scores and Z-scores but not show up in either geometric hashing.
Case studies by comparison with other servers/softwares.
Cases from same SCOP families
Cases from different SCOP families
A supporting website http://22.214.171.124:8080/bssf/ was constructed for user-defined calculations and predictions. The web site was built with Java JSP technology and the analysis and prediction procedure consist of four steps: 1) The user supplies a newly determined crystal structure or a PDB entry ID to conduct the analysis. Following submission of the data, the binding site detection and fingerprint calculations are initialized. 2) The user can visualize the binding sites in the crystal structure using the web embedded Jmol program. 3) The user can select the binding site to perform a database search with the Z-score scheme. After finishing this fast database search (normally about 1 minute), the result will appear in a table on the web page for inspection. The user can download the result for later analysis or further query the NCBI with blast program, checking the PDB entry in PDB database. Also, the user can take further steps to analyze the hit list to check the occurrences of some basic information such as GO terms and EC numbers. These metric may benefit researchers to design new experiments to study the query protein at hand.
A major goal of structural biology is to understand cellular functions in the context of the atomic details of molecules. With increasing deposits of three-dimensional structures in the RCSB Protein Data Bank through structural genomics initiatives, there is a pressing requirement for experimental or computational methods to correlate functions to these structures. This sequence-structure-function relationship underlies the numerous investigations aimed at dissecting the biological properties of proteins.
In the present work, we describe a fast method for detecting similar binding sites in protein structures in the whole PDB database. This may shed light on protein function and possible drug side-effects due to ligand cross binding to similar sites. Our method is developed for 3D local structure similarity detection and complements sequence-based or fold-based methods. It can uncover similarities in small spatial surface regions on protein structures and provide additional evidence for inferring protein functions. In contrast to many 3D motif similarity searching methods, we use a fingerprint approach to represent the binding sites. The fingerprint concept is heavily implemented in the chemoinformatics field for small molecule database searching and has been proved to be fast and useful in ligand similarity research. We have extended it to label binding sites in macromolecules. To simplify the large number of atoms in binding sites and to implicitly add flexibility for fingerprint representation, every residue in each binding site was fragmented into subgroups and mapped to 7 properties similar to the pharmacophore concept. This fingerprint representation not only eliminated the sequence order dependence usually encountered in sequence/structure similarity measurements, but also enabled an ultrafast method of searching a comprehensive binding site database. The time consumed to perform a single query with our binding site database (188959 entries after the binding sites smaller than 30 points or larger than 200 points being filtered) is approximately 1 minute on an Intel XEON 2.8 G processor. This gives researchers tremendous opportunities to conduct large scale comparison studies to elucidate functions for any possible binding site.
The method described here differs substantially with structure-template based methods. In these methods, local structure-template are curated from the protein structures and usually only contains very few residues, exemplified in TESS system three residues catalytic traid "O-HIS-O". Although the method presented here also needs to extract the binding sites from PDB structures in advance, the binding sites are not limited to a fixed number of residues. Based on the fingerprint concept, variable binding sites can be represented and compared without any difficulty. Such circumstances would be very time-consuming with the graph clique algorithm or geometric hashing algorithm based methods [22, 23]. Recently, Xie and Bourne, based on the weighted graph maximum clique detection algorithm, devised a method SOIPPA, which can find the similar functional sites through sequence order-independent profile-profile alignment. Through implemented several heuristic rules, authors accelerate the functional matching phase and simultaneously found and aligned the similar binding sites. But due to that the intrinsic algorithm is based on the graph maximum clique detection, the running time still beyond the routine database search for the whole PDB database with all the possible binding sites, especially in the situation of fast growing of the structures out from the structural genomics project.
In comparison to the similar very fast method pvSOAR, which also extracts possible binding sites with an automatic alpha shape method, our method is sequence-order independent and does not take into account the local sequence similarity between the two binding sites. This represents a more natural way to describe the shape and properties of a binding site, especially where two binding sites only share sub-pocket similarity. WebFeature is another ultrafast binding site similarity comparison and functional annotation system. It uses the calculated biophysical properties of binding sites to represent the binding site and utilizes a machine learning method to train the system and predict the function for the query. Compared to their approaches, our method is merely based on the original PDB structure data and does not go through the training phase. Also the WebFeature method is based on the already determined functional motif stored in PROSITE database, then may not cover all the possible binding sites represented in the PDB structures.
The binding site database in our method could be further improved to expand coverage and accuracy. In the current implementation, only binding sites that involve single polypeptide chains are taken into account. We do so mainly because it is very difficult to separate true multi-chain complexes from artefacts due to crystal packing interactions in a unit cell. Nevertheless, including such binding sites in our database will expand its coverage and enhance function inference. Another drawback of our method is that the PASS predicted binding sites may not be the true binding sites on the protein surface. This may increase the false positive rate and reduce prediction power. Although there exists such computational methods to identify the true binding sites, this shows to be a very difficult task due to the limitation of our knowledge of possible protein-ligand interactions which exist in nature.
A major challenge in analyzing local spatial patterns is how to assess the significance of the detected similarity. Due to the difficulty to obtain the gold standard of the binding site data sets, we decided to use the simulated binding sites as the representation for later statistical judgement. To overcome the limitations of the original fingerprint Canberra Distance score function, we devised a Z-score scheme and investigated its boundary in detail by gradually changing two variables, namely Z-score cut-off number and Fix number. It was found from the ROC curve that the performance is promising even at a Z-score cut-off value of -1.5 and with less than 50% added random points. This validation strengthens the utility of our method and provides guidelines for later database searches. As demonstrated by the ROC curve, our method has the capability to detect sub-pocket similarity. It is very important in drug design, to detect such weak similarity, since a ligand may only interact with a few key residues in a binding site to execute its biological role. This will help researchers to identify possible targets similar to known drug targets and to predict side-effects for certain drugs.
In future, one important further extension of our method is to combine it with other sequence-based or structure-based function inference methods to enhance accuracy in assigning functions. Recently Brylinski and Skolnick provided a method named FINDSITE, which can locate the binding sites in protein structure through a threading alignment of distant homologies. Their method can successfully identify 70.9% binding sites in the top five predicted binding sites. Although the prediction power dropped down when the sequence identities of homologies are below 35%, combinations with the fold information or sequence information could improve the prediction accuracy. Like in the comprehensive protein functional annotation database ProKnow, Pal etc. integrate information about the query protein and then weighted the information in a Bayes framework. Their investigation clearly demonstrated that the multiple sources of information will enhance the prediction power. Given the ability of our ultrafast binding site similarity method, it could be assembled with others sequence and structure similarity measurement and improves the prediction for the binding site functions.
It is well recognized that sequence and structural fold are dynamically changing under evolutionary pressure over long time scale. These diverging and converging evolutionary phenomena produces a challenging problem of how to infer the functions of newly discovered genes from their sequences and structures. Many sequence-based and structure-based methods have been developed to correlate functions to sequences and structures and to extend our ability to understand the fundamental relationship between sequence, structure and function. Although some cases can be easily solved, some more difficult cases often just contain very weak sequence and structure similarity with proteins with known functions in curated databases. As a consequence, there is an ongoing need of novel methods to broaden our capability to predict function in this post-genomics era. Here we presented a novel and fast binding site similarity detection and function inference system. By utilizing fingerprint representations of binding sites, we are able to conduct an economical similarity measurement. Furthermore, for the accurate detection of similar binding sites, especially ones where there is only weak or sub-pocket similarity, a statistical validated Z-score scheme was devised to improve sensitivity. This system could be used in the drug design field to identify promising targets for drugs by using the binding site of its known target as a query. It could also benefit researchers in the field of structural biology field by allowing them to find similar structures at binding site level.
We thank Dr. Albert M. Berghuis (McGill University, Canada) to provide critical comments and suggestions. This work was financially supported by National Natural Science Foundation of China (Grant 30600784 to B.X.) and State Key Program of Basic Research of China (Grant 009CB918502 to B.X.).
Funding: This work was financially supported by:
1, National Natural Science Foundation of China Grant 30600784.
2, State Key Program of Basic Research of China Grant 009CB918502.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.