Identification and analysis of conserved pockets on protein surfaces
© Cubellis et al.; licensee BioMed Central Ltd. 2013
Published: 22 April 2013
Skip to main content
© Cubellis et al.; licensee BioMed Central Ltd. 2013
Published: 22 April 2013
The interaction between proteins and ligands occurs at pockets that are often lined by conserved amino acids. These pockets can represent the targets for low molecular weight drugs. In order to make the research for new medicines as productive as possible, it is necessary to exploit "in silico" techniques, high throughput and fragment-based screenings that require the identification of druggable pockets on the surface of proteins, which may or may not correspond to active sites.
We developed a tool to evaluate the conservation of each pocket detected on the protein surface by CastP. This tool was named DrosteP because it recursively searches for optimal input sequences to be used to calculate conservation. DrosteP uses a descriptor of statistical significance, Poisson p-value, as a target to optimize the choice of input sequences. To benchmark DrosteP we used monomeric or homodimer human proteins with known 3D-structure whose active site had been annotated in UniProt. DrosteP is able to detect the active site with high accuracy because in 81% of the cases it coincides with the most conserved pocket. Comparing DrosteP with analogous programs is difficult because the outputs are different. Nonetheless we could assess the efficacy of the recursive algorithm in the identification of active site pockets by calculating conservation with the same input sequences used by other programs.
We analyzed the amino-acid composition of conserved pockets identified by DrosteP and we found that it differs significantly from the amino-acid composition of non conserved pockets.
Several methods for predicting ligand binding sites on protein surfaces, that combine 3D-structure and evolutionary sequence conservation, have been proposed. Any method relying on conservation mainly depends on the choice of the input sequences. DrosteP chooses how deeply distant homologs must be collected to evaluate conservation and thus optimizes the identification of active site pockets. Moreover it recognizes conserved pockets other than those coinciding with the sites annotated in UniProt that might represent useful druggable sites. The distinctive amino-acid composition of conserved pockets provides useful hints on the fundamental principles underlying protein-ligand interaction.
Proteins are large molecules characterized by complex structures whose main function is to keep relatively small active sites in good shape. Indeed a precise 3-D architecture is necessary to grip ligands efficiently. A protein surface represents an irregular landscape, rich in pockets and clefts. There can be various binding sites in a single protein and ligands can be as small as ions or large polymers and can function as substrates, inhibitors or allosteric modulators.
Several methods predict ligand binding sites or functional important residues on protein surfaces. Some, such as CastP [1, 2] or SURFNET , exploit geometric properties. Others, such as PDBinder , are knowledge-based. Others again, such as ConSurf , ConCavity , LIGSITE , Crescendo , combine 3D-structure and evolutionary sequence conservation.
To explore the pockets on protein surfaces we developed DrosteP, a program which assesses the conservation of the residues lining the pockets on a protein surface. DrosteP requires that the pockets on the surface of a protein with known 3D-structure are identified with the program CastP [1, 2] and that conserved amino acids are identified aligning multiple homologous sequences.
Any method relying on conservation mainly depends on the choice of the input sequences and the originality of our approach derives from the fact that we specifically address this point. DrosteP uses a descriptor of statistical significance, Poisson p-value, as a target to optimize the choice of input sequences. We demonstrated that deciding how far one should go to collect homologs to calculate residue conservation influences the precision of active site pocket identification.
Given a protein structure, the output of CastP analysis and a set of homolgous sequences, DrosteP supplies the most conserved pocket as well as with the identity of the other conserved pockets. For each pocket the amino acid composition is provided in order to facilitate the identification of the active site and of other druggable pockets
Given the sequence of a protein with known 3D-structure, we gather N homologous sequences and we put them in order by increasing e-value. We build different alignments in FASTA format including sequences with higher e-value recursively. For each alignment j (1 < j < N-1) we identify conserved amino acids and we calculate the total number of atoms belonging to conserved amino acids, TCAj. Under standard conditions the program uses 50 sequences form Uniprot/Swiss-Prot, but any alignment in FASTA format of length N can be chosen. In order to avoid any arbitrary choice of a scoring system, DrosteP identifies amino acids that are 100% identical, but the percentage is lowered if no amino acid completely conserved are found in the multiple alignment. We ran CastP [1, 2] on the protein structure and get the atoms and amino acids lining each pocket.
We calculate the observed pocket conservation OPCij i.e. the observed number of atoms lining the pocket i and belonging to amino acids conserved in alignment j. We estimate the expected pocket conservation EPCij i.e. total number of atoms belonging to conserved amino acids in the alignment j (TCAj) multiplied by the number of atoms lining pocket i (PAi) divided by the total number of atoms of the protein(TA) (EPCij = TCAj × PAi/TA).
We choose the alignment j which provides the OPCij with the lowest p-value (Poisson probability p = e-EPCij EPCijOPCij/OPCij !).
Once the best alignment is chosen, we identify the pockets enriched in conserved amino acids (OPCij/EPCij > 1) at a level of statistical significance, Poisson p-value, lower than 0.05. From now on these will be defined as conserved pockets. The pocket enriched in conserved amino acids and associated with lowest p-value is predicted to be the active site pocket.
To validate the method for identification of active sites based on pocket conservation, we used two groups of proteins with known 3D-structure whose active site had been annotated in Uniprot/Swiss-Prot.
A first set comprises human monomeric proteins. A second set comprises human homodimers, but we restricted the analysis to cases where two chains are found in the crystallographic asymmetric unit. Using these criteria we collected 460 monomers and 165 homodimers.
To test the effect of redundancy, we run skipredundant  to obtain the cluster of sequences that have less than 70% identity pair wise.
Pocket size was measured running CastP. To calculate amino acid composition, we summed all the atoms of a given amino acid lining the pockets and divided the result by the total number of the atoms in the pockets. This calculation was carried out considering all pockets or considering separately conserved (OPCij/EPCij > 1 and p < 0.05) or non conserved pockets (OPCij/EPCij < 1 or p > 0.05).
Prosite patterns were downloaded from ftp://ftp.expasy.org/databases/prosite/. Patterns containing only conserved Cys were excluded, for the remaining patterns, only amino acids completely conserved were considered to calculate amino acid composition.
One-Way Analysis of Variance for Independent Samples was performed to evaluate statistical significance of differences observed when comparing DrosteP and Consurf and when analyzing amino acid or atom type abundance in conserved pockets.
Most programs identify functionally important residues in protein structures. We addressed a slightly different problem, that is, we tried to identify the active site pocket in a protein.
It can be expected that active site pockets contain some functionally active residues, but they also contain residues that are not directly involved in ligand binding. For some studies it is preferable to identify active site pockets. For instance it was demonstrated that mutations occurring in the active site pocket, but not affecting residues directly binding the ligand (as seen in the x-ray structure of protein-ligand complex) can be pathological and prevent the use active site directed drugs . Moreover the definition of an active site pocket is required to bind small ligands to proteins by in silico docking.
Our method, DrosteP, complements a program which finds pockets on protein surfaces. For the preliminary task of pocket identification we chose CastP [1, 2], but in principle other programs might be employed. Once pockets on the surface of a protein with known 3D-structure have been identified, homologous sequences are collected and aligned in FASTA format. Combining results obtained by CastP and the alignment, DrosteP assesses the conservation of the residues lining the pockets.
The number of conserved residues depends on the depth reached in gathering homologs. For this reason DrosteP starts with the first two sequences of the alignment and recursively adds aligned sequences.
For each number j of homologous sequences included in the alignment, DrosteP calculates observed (OPCij) and expected (EPCij) conservation for each pocket i detected by CastP and chooses the alignment j which provides the lowest Poisson p-value. Once chosen the optimal alignment, DrosteP allows to distinguish between non conserved and conserved pockets and to select the most conserved pocket in the latter group, which is predicted to be the active site pocket.
DrosteP precision was calculated counting the numbers of the most conserved pockets that include (TP) or do not include (FP) the active site residues annotated in Uniprot/Swiss-Prot. The tests were carried out running CastP on monomeric or homodimeric human proteins and aligning the sequence of each one to homologous proteins from other species.
We obtained higher precision on homodimers with the method based on sequence conservation (78% versus 67%) and comparable precisions on monomers (84% versus 80%). The size of conserved pockets is variable but, in general they are larger than the non conserved pockets (Figure 1B). We can detect active sites of monomeric proteins with higher precision if we consider both conservation and size. In fact the active site of monomeric proteins coincides with the largest among the conserved pockets in 90% of the monomeric cases. On the other hand in the case of homodimers, precision drops to 76% when the largest pocket among the conserved ones is considered (green bars in Figure 1A).
Recent methods that predict active site pockets or functional residues exploiting evolutionary conservation, ConSurf  and LIGSITE , rely on Consurf-DB , a database where sequence homologs of each of the PDB entries were collected and aligned using standard methods.
The novelty of our method consists in the optimization of the list of homologous sequences to be used to calculate conservation. To test this point specifically, we detected the most conserved pocket either considering the complete list of homologous sequences stored in Consurf-DB (violet bars in Figure 1A) or the optimal list obtained by truncation with DrosteP (red bars in Figure 1A). We found that DrosteP raises precision from 49% to 84%, for monomer and from 57% to 78% for dimers. For a comparison, we also identified the active site pocket as the one which contains the most conserved residue predicted by Jensen Shannon Divergence (JSD) score  and the results are shown in Figure 1A by orange bars.
To test the effect of redundant sequences in the alignment, we created for each protein a set from Consurf-DB where no pair of homologous sequences had more than 70% identical residues. The overall precision of DrosteP with and without redundant sequences was 81% and 76% respectively.
Comparison between DrosteP and Consurf predictions
96.07 ± 0,47
79.31 ± 0.40
6.21 ± 1,99
1.95 ± 0.60
54.15 ± 11,17
90.85 ± 5.06
True negative rate
96.28 ± 0.59
79.26 ± 0.41
Negative predictive value
99.77 ± 0.13
99.94 ± 0.05
Generally speaking, direct identification of active sites and functional residues relies on experiments of mutagenesis or chemical modification and provides only a limited number of the residues involved in substrate binding. In some fortunate cases a more detailed view of the active site is possible because the protein structure has been solved in the presence of a substrate analogue. But even in these cases the view can be incomplete. In fact natural substrates are often molecules larger than those co-crystallized with the enzymes. The knowledge of the wide-ranging active site, that is the surface which is in touch with the natural substrate or ligand, is useful in many cases. For instance it is needed to evaluate the effect of protein modifications in human diseases because mutations occurring in the active site pocket, even if not directly involved in catalysis, are almost inevitably harmful and unsuitable for therapy with chaperones . Pharmacological chaperones which represent a novel and promising approach for the cure of many diseases, are typically reversible inhibitors used at low concentration to stabilize the pathological forms of a protein  and need a functional active site to bind.
DrosteP is very useful to extend the active site and cover all the surface which is in contact with the natural substrate or ligand. We propose to start from the most conserved pocket and enlarge it by addition of the contiguous conserved pockets. At the present this procedure is manual but, in principle, it could be made automatic.
In order to discuss this effect we added in Figure 3B the amino acid composition calculated on conserved residues found in the Prosite patterns . A pattern describes a short, contiguous stretch of protein which is conserved in a protein family either for functional or for structural reasons.
Gly is a small amino acid whose abundance in proteins from vertebrates is similar to that of Ala and Ser. Nonetheless it is overrepresented with respect to Ala and Ser among the amino acids conserved in Prosite patterns. This finding suggests that peculiarities other than the small size bestow a special role to Gly. Limiting the analysis to the protein surface, we observe that Gly is more abundant in conserved than in non conserved pockets. The absence of a side chain, the possibility to adopt different dihedral angles as well as its requirement in beta turns, might be the cause of its overrepresentation in conserved pockets.
Charged amino acids are relatively abundant among residues conserved in Prosite patterns. It is common knowledge that His plays a pivotal role in many active sites and it is not surprising that it is preferentially seen in conserved pockets. Less expected is the finding that the only other charged amino acid preferentially found in conserved pockets at a statistically significant level is Asp.
Main chain amide nitrogen is by far the atom type with the highest preference for conserved pockets. Generally speaking we observed that hydrogen bond donors are more abundant in conserved pockets than hydrogen bond acceptors. The relative abundance of NE1 from Trp and OH from Tyr which can contribute with hydrogen bonding, suggests that aromatic interaction provides selectivity as well as stability to protein ligand interactions.
Our results rely on inference from in silico analysis of structural databases and provide an understanding of the role of specific amino acids and atom types in molecular recognition. Other studies addressed the same problem from different points of view.
Koide and Sidhu summarized their findings with a brilliant title "The Importance of Being Tyrosine: Lessons in Molecular Recognition from Minimalist Synthetic Binding Proteins". They exploited synthetic antibody libraries and demonstrated that "antigen-binding sites that are rich in Tyr and Ser are highly specific and functional, but the addition of Gly can improve function." They also observed that "surfaces containing Trp limited to CDR-H3 are also fairly specific" .
Vajda and coworkers  exploited another approach, computational solvent mapping, to identify hot spots for protein ligand interaction . They chose ten important pharmaceutical targets and they found almost invariably aromatic residues, preferentially Tyr, among the residues important for ligand binding.
Fused ring systems are frequently encountered in molecules in use as approved or experimental pharmaceutical drugs. The layout which requires a face to face interaction with an aromatic amino acid and face to edge p-X interaction, where X is an heteroatom, might provide a preferential drug binding mode.
Each protein in a cell is committed in multiple interactions and its surface is richly carved to fulfill this requirement. In this contest it is reductive to refer to one specific pocket as the active site. The active site can be a hive shaped concave opening. The primary site harboring catalytic residues can be surrounded by other pockets forming a composite surface which accommodates the substrate. Separate pockets can be exploited to bind allosteric ligands. Active sites and allosteric sites are meant to bind natural ligands. In addition to these, other druggable pockets could exist which are not occupied under physiological conditions and could be exploited to bind drugs. In order to find hints on the fundamental principles underlying protein-ligand interaction we need to analyze large number of cases and derive statistically significant conclusions. Identification of active sites is difficult and requires the intervention of an expert human eye. On the other hand it is feasible to define conserved pockets precisely and, utilizing such a definition, it is possible to search a database of proteins with known 3D-structure and gather data.
We studied 3,635 non conserved and 460 conserved pockets and a total of 57,815 atoms. Since most conserved pockets are exploited to bind natural ligands, the residues or the atom types preferentially exposed in conserved pockets should be the most suitable for molecular recognition. Three amino acids Tyr, Trp and Gly as well as the main chain amide nitrogen are particularly abundant in conserved sites. This finding is useful when looking for druggable sites on a protein and when scoring poses obtained with in silico ligand docking.
The publication costs for this article were funded by Telethon - Italy (Grant no. GGP12108) to MVC and GA
The financial support of Telethon - Italy (Grant no. GGP12108) and of MIUR (PRIN 2009 2009MBHZPR_002) is gratefully acknowledged". We thank G. Riccio for proofreading the manuscript
This work is dedicated to our friend and colleague Dr Maria Malanga
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 7, 2013: Italian Society of Bioinformatics (BITS): Annual Meeting 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S7
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.