Skip to main content

Combining specificity determining and conserved residues improves functional site prediction



Predicting the location of functionally important sites from protein sequence and/or structure is a long-standing problem in computational biology. Most current approaches make use of sequence conservation, assuming that amino acid residues conserved within a protein family are most likely to be functionally important. Most often these approaches do not consider many residues that act to define specific sub-functions within a family, or they make no distinction between residues important for function and those more relevant for maintaining structure (e.g. in the hydrophobic core). Many protein families bind and/or act on a variety of ligands, meaning that conserved residues often only bind a common ligand sub-structure or perform general catalytic activities.


Here we present a novel method for functional site prediction based on identification of conserved positions, as well as those responsible for determining ligand specificity. We define Specificity-Determining Positions (SDPs), as those occupied by conserved residues within sub-groups of proteins in a family having a common specificity, but differ between groups, and are thus likely to account for specific recognition events. We benchmark the approach on enzyme families of known 3D structure with bound substrates, and find that in nearly all families residues predicted by SDPsite are in contact with the bound substrate, and that the addition of SDPs significantly improves functional site prediction accuracy. We apply SDPsite to various families of proteins containing known three-dimensional structures, but lacking clear functional annotations, and discusse several illustrative examples.


The results suggest a better means to predict functional details for the thousands of protein structures determined prior to a clear understanding of molecular function.


Structural genomics, and the increased pace of structure determination by X-ray and NMR approaches makes methods to predict protein function from 3D structure of continuing importance. Proteins of known structure and unknown function are normally subjected to a battery of comparisons to find proteins adopting similar folds (DALI [1], SSAP [2], CE [3] and others) or containing recurring active-site residue constellations (SPASM [4], PINTS [5], Catalytic Site Atlas [6]). Proteins of similar structure can provide functional hints, since it is very often the case that proteins share structural and functional similarities in the absence of sequence similarity. Active site-only similarities (i.e. in the absence of overall fold similarity) are typically less revealing, but can sometimes suggest the presence of a convergently evolved catalytic machinery (e.g. the peptidase catalytic triad [7]) or binding sites for particular metals or ligands [8].

When comparative approaches fail to identify a clear similarity, or if such similarities are ambiguous – for instance, suggesting a possible weak functional similarity requiring confirmation – additional functional hints can come from analyses of the protein structure, similar structures, and what is typically a large collection of homologous sequences from the set of genomes now available. There are now several methods that exploit sequence conservation to identify putative functional sites in proteins, including ConSurf [9], approaches based on identification of 3D clusters of conserved residues [10], the evolutionary trace (ET) approach [11, 12], correlated mutations [13], prediction of 3D motifs correlated with function [14], Jensen-Shannon entropy approach [15], algorithm based on contrasting global and local similarity matrices that interpret locality in terms of sequence [16] or structure [17]. The approaches differ in design, but share a unifying theme of using conserved amino acids, together with structural constraints such as location on the protein surface, as indicators of likely functional importance.

Some of the presented techniques make a special emphasis on using a protein structure for prediction. A number of methods identify interaction hot spots on different kinds of interaction interfaces [1820]. Other methods concentrate on predicting pockets in protein structures, as they are possible ligand-binding sites [2123], sometimes supplementing them with annotation derived from other available sources [24].

In contrast to those, several approaches, including our own [25], have attempted to exploit protein sequence alignments to determine those residues likely to confer specificity for a particular sub-function in a protein family [2541]. Although they differ in algorithmic details, all approaches aim to use the statistics of a multiple sequence alignment to identify positions that correlate well with sub-families that account for a certain specific function. Sub-families are either explicitly given in advance (for instance taken from gene or protein annotation) or are predicted by the algorithm.

Here, we extend our previously derived approach for identifying specificity determining residues [25] to the problem of predicting protein functional sites (SDPsite). We combine routine predictions of conserved residues with those for specificity determinants, and use structural information to identify spatial clusters of the predicted important residues. SDPsite differs from structure-based methods in that the major part of the prediction is derived from the protein sequences. So, in theory, the method can also be applied in absence of structural information. The structure-based step filters out part of the predicted positions, thus leaving only the most reliable predictions, which can be useful in the design of experimental studies.

To test our method on a large scale, we developed two benchmark datasets of diverse enzyme families, using the Enzyme Classification (EC) system. Enzymes are the simplest class of proteins to benchmark as their functional annotations are well specified in databases. However, they are not necessarily representative of other protein functions that are less discretely characterized by precise catalytic machinery (e.g. protein recognition modules, etc.). In the absence of a reliable source of functional annotations for non-enzymes, we previously tested the presented approach on two examples, for which reasonable experimental data are available, the LacI family of the bacterial transcription factors and subtilisin-like proteases [42]. On these examples, SDPsite results have a sensitivity between 0.06 and 0.17, specificity between 0.43 and 0.75 and false positive rate between 0.007 and 0.05. Thus, in these anecdotal cases, SDPsite seems to miss a lot of truly functional amino acids, but still provides reliable predictions.

We also predicted the functional sites in 124 unannotated structures derived from Structural Genomics efforts. For the benchmark datasets, the success rate of our method (SDPsite) was 96–100%. We were then able to make confident functional site predictions for ~76% of a set of families lacking functional annotation.


1. Testing SDPsite on a benchmark set of enzymes with bound cognate ligands

We previously tested SDPsite on a number of protein families with known sites and compared the performance with several other approaches [42]. The results encouraged us to predict functionally important sites in poorly characterized protein families. Structural genomics projects now provide up to 20% of annual growth of the Protein data bank (PDB) and a greater coverage than ever before of the space of protein structures [43]. This has led to the current situation where hundreds of protein families include a protein with a known 3D structure, but no available functional information. These families are a perfect target for functional site prediction methods that use both sequence and structural information.

However, before applying our approach to families lacking functional information, we needed to benchmark the approach on a set of protein families that are well-characterised in terms of function. To do this we considered enzyme protein families from the Pfam database, for which there is a functional characterization scheme in the Enzyme Classification (EC) number system. This classification consists of four numbers denoting a hierarchical system that delineates enzyme function. We focused on families containing EC numbers differing in the last number, which normally accounts for the substrate specificity. Since protein families generally correspond to a single functional or structural domain, complications can arise for multi-domain proteins that correspond to a single EC number. We inspected the families manually to ensure that the catalytic operation for each EC number did indeed correspond to the domain considered. Thus each considered protein domain corresponds to a single EC number, thus catalyzing only one reaction (or one class of reactions), and presumably have only one active site.

To assess performance of SDPsite on these Pfam enzyme families, we generated two benchmark datasets. The first consisted of families containing proteins with at least two EC numbers differing in the last position; the second consisted of families containing only a single EC number. We refer to these datasets as diverse and homogeneous in the sections that follow. The rationale is that when one is predicting function and/or specificity, one does not known in advance whether or not there are multiple specificities in the family. These two datasets mimic both of these situations.

For all families we computed both specificity determining positions (SDPs) and conserved positions (CPs), then mapped them onto a 3D structure of one of the proteins of the family and extracted a portion of the two sets that forms a compact spatial cluster, as described previously [42] and in the Methods section. We designed several distance measures to assess the quality of the predictions. These were: 1) the minimal distance from the residues of each of the predicted sets (SDPs, CPs, best cluster) to a bound ligand; 2) the average distance to the ligand; 3) the diameter of the set; and 4) the average distance between residues of the set. We performed a Mann-Whitney test to assess the statistical significance of the best derived cluster, i.e. we tested if the set of the amino acid residues in the best cluster is significantly closer to the ligand than all residues in the protein. These data are given in Additional file 1.

We considered predictions to be successful if minimal distances were smaller than 5 Å and average distances smaller than 10 Å. We selected these thresholds based on inspection of known binding sites, and found that they capture characteristics of typical binding sites, which are normally 15–20 Å in diameter and typically some of the amino acids of the cluster contact the ligand directly. A small minimal distance and a large average distance means that the cluster is too sparse and does not define the active site well enough, but still a part of it is close to ligand and might be functional. Generally there is no correlation between either diameter or average distance within a predicted set of residues and the set's proximity to the active site.

As might be expected, predicted SDPs tend to be more sparsely distributed in the structure, compared to the more compact distribution of CPs. The best clusters are tightest, which is natural from their construction procedure, though the minimal distance suggests they are sometimes further away from the active site (even if the average distance is similar to CPs). We discuss these observations in more detail below.

a. The diverse dataset: protein families with at least two distinct EC numbers

Application of all the filters described in the Methods section yielded 26 Pfam families (Table 1). SDPsite was applied in different ways, either ignoring SDPs thus mimicking the standard, conservation-based approaches, or including them when constructing the best cluster. For the inclusion of SDPs, we either gave them twice the weight as the CPs (λ = 0.5, λ being the relative weight of a CP to an SDP) or same weight (λ = 1). (Fig. 1). For details on the choice of the λ parameter, see Methods. In all but one of the considered families (Carboxylesterase, see below) at least one predictor performs well, and in the Asparaginase_2 family the average distance is slightly higher than 10 Å. This means that the best clusters are located in enzyme catalytic sites, and some of residues are in direct contact with the ligand. This result is significant (p < 0.01) for all but four, one of which is the Carboxylesterase family; for the other three the best cluster contains positions accounting for intersubunit contacts. The resulting implications for quaternary structure are discussed below.

Table 1 Statistics of the benchmark datasets: diverse dataset, two or more EC numbers per family
Figure 1
figure 1

Assessment of the prediction quality for the diverse dataset. In each plot, the green and the blue bars represent SDPsite predictions with λ = 0.5 and λ = 1, respectively. Yellow bars represent prediction based solely on conserved positions. (a) Minimal distance from the best cluster to the bound ligand. (b) Average distance from residues of the best cluster to the bound ligand. (c) Significance of the average distance.

Average values for all the above measures are given in Tables 2 and 3. Note that SDPs contribute to identification of the active site, although leading to prediction of a more disperse cluster. When no SDPs are predicted (last column), the average distance is smaller, because CPs form a more compact cluster in the active pocket. Unlike other methods that attempt to predict functional sites solely using the conservation of surface residues, SDPsite predicts a number of additional positions of potential importance. In 11 out of 26 families the best cluster is significantly (p-value < 0.01 in a Mann-Whitney test) closer to the ligand in both λ = 0.5 and λ = 1 scenarios, and in 20 out of 26 in at least one of them, whereas the CPs-only based prediction succeeded in only 15 families (Table 4). This indicates that the inclusion of SDPs in the prediction often leads to a significant improvement.

Table 2 Averages for the best cluster over all families from the Datasets, Å.
Table 3 Sensitivity and false positive rate over all families from the Datasets, Å.
Table 4 Assessment of SDPsite versions, diverse dataset.

For instance, in the Protein-L-isoaspartate(D-aspartate) O-methyltransferase (PCMT, PF01135) and Thioesterase (4HBT, PF03061) families, the CP-based cluster is located distant from the active site, whereas addition of SDPs rescues the prediction leading to the correct site. In the C-terminal domain of Pyridoxal-dependent decarboxylases (Orn_DAP_Arg_deC, PF00278), bacterial transferase hexapeptide (Hexapep, PF00132) and Asparaginase (Asparaginase_2, PF01112) families, SDPs rescue the cluster for λ = 0.5. In contrast, for the Shikimate kinase family (SKI, PF01202) a heavier reliance on SDPs skews the prediction, whereas more equal considerations of SDPs and CPs, or of CPs only, perform considerably better.

The Carboxylesterase (COesterase; PF00135) family is the only clear failure of the method, i.e. its active site is not identified by either variant of the method. Even the catalytic triad, Ser198, His438 and Glu197 (numbering from ChlE_Human), is not among either SDPs or CPs. The fact that the catalytic residues are not conserved in the alignment is puzzling. This could be because the alignment from Pfam contains many uncharacterized paralogs from C. elegans and D. melanogaster, which could perform a different function or be non-functional. Indeed, catalytic residues are often substituted in these proteins: Ser198 to alanine, asparagine, glycine or valine, His438 to asparagine, glutamic acid, leucine, lysine, tyrosine or valine, and Glu197 to asparagine, glutamic acid, glutamine, histidine, isoleucine, proline, threonine, tryptophane or tyrosine. Such changes mean that these residues are ignored in the prediction procedure, and highlights the need for some caution when building alignments to predict function.

We overlook details of quaternary structure when making predictions, and this can have interesting consequences, as discussed previously (e.g. ref. [25]). For instance, for the Thiolase N-terminal domain (Thiolase_N, PF00108) family, we found the minimal and the average distance to be rather large. From the structure of a protein from this family (biosynthetic thiolase from Z. ramigera, 1NL7), it is evident that the best cluster is located on the contact interface between two subunits of a dimer. Indeed, the minimal and the average distance to the second subunit are 2.73 Å and 6.90 Å, respectively. The second best cluster is, however, in the active pocket with the distances below the threshold. The family of Gcn5-related acetyltransferases (Acetyltransf_1, PF00583) is a similar case: for λ = 0.5 the best cluster is located on subunit contact interface and the second best in the active site, for λ = 1 vice versa. This highlights the need to consider quaternary structure explicitly when making and interpreting predictions using this or similar approaches.

A natural question is how well the predicted grouping of the sequences agrees with the EC numbers of the proteins considered. For most families there was a good agreement, with EC numbers segregating into discrete branches of the trees derived from the alignments. There were two families where proteins with one EC number would split between two groups that contain proteins with other EC numbers (alpha-amylase, PF00128, and polyprenyl synthetase families, PF00348). For both, the same enzymatic activity seems to evolve independently on two separate branches of the phylogenetic tree.

b. The homogeneous dataset: protein families with a single EC number

The 18 families with a single EC number are listed in Table 5. Again, we applied SDPsite with λ = 0.5, λ = 1 and without prediction of SDPs (Fig. 2). For all studies families, except Eukaryotic phosphomannomutases (PMM, PF03332), at least one of these variants puts the best cluster to the active site of the enzyme according to the described criteria. For 9 out of 18, the best cluster identified by either procedure is located in the active site. These results are significant (p < 0.01) for all families, except Adenylylsulphate kinases (discussed below).

Table 5 Statistics of the benchmark datasets: homogeneous dataset, strictly one EC number per family
Figure 2
figure 2

Assessment of the prediction quality for the homogeneous dataset. Color code as in Fig. 1. (a) Minimal distance from the best cluster to the bound ligand. (b) Average distance from residues of the best cluster to the bound ligand. (c) Significance of the average distance.

The remaining nine families, for which active sites were not identified by all variants of SDPsite, can be split into four categories: (1) λ = 0.5 fails (Thymidylate synthases, Thymidylat_synt, PF00303; thymidine kinases from Herpesviridae, Herpes_TK, PF00693); (2) both λ = 0.5 and λ = 1 fail (GTP cyclohydrolases II, GTP_cyclohydro2, PF00925; Phosphoenolpyruvate carboxykinases, PEPCK_ATP, PF01293; Adenylylsulphate kinases, APS_kinase, PF01583); (3) CP-only prediction fails (Queuine tRNA-ribosyltransferases, TGT, PF01702; Pyruvate formate lyases, PFL, PF02901); and (4) at least two of the three fail (oxygenase domain of Nitric oxide synthases, NO_synthase, PF02898; Eukaryotic phosphomannomutases, PMM, PF03332).

SDPs are expected to perform worse for this dataset, because in theory there are no specificity differences within each family as defined by the EC system. However, CP-based clusters significantly skew the predictions in two cases out of eighteen. One explanation of this observation could be inaccuracy or ambiguity in the assignment of the EC numbers, which would mean that we have identified some real differences the specificity. It might also be that a substitution to an amino acid with similar properties (e.g. size, polarity, charge) occurred, which can perform an equivalent enzymatic role. Another possibility is the functional convergence to a single specificity at the molecular level: SDPs indicate different residues that distant species or distant paralogs evolved to perform the same function (e.g. [44]). Alternatively, as discussed below, SDPs may indicate differences specific to certain phylogenetic clades.

Similarly to the diverse dataset, there are two families for which the best cluster is located on the subunit contact interface: Thymidylate synthases and Adenylylsulphate kinases. The minimal and average distances to the second subunit of the dimer are 3.30 Å and 5.73 Å for Thymidylate synthases and 2.65 Å and 6.55 Å for Adenylylsulphate kinases. For both, the second best cluster is in the active site.

For two families, all three versions of SDPsite produce poor results. For the oxygenase domain of Nitric oxide synthases the average distance of the residues from the best cluster to the ligand is between 10 and 15 Å for all three versions, and for Eukaryotic phosphomannomutases it even exceed 15 Å in two out of three versions. However, for both families, the second best cluster is located in the active site. In the case of the oxygenase domain of Nitric oxide synthases the alignment contains only 8 sequences, which split into phylogenetic groups of insects and vertebrates. These groups are too distant, and it seems that the method cannot remove the phylogenetic trace completely, thus producing many SDPs that are probably not functionally important. A more representative set of sequences may improve the predictions and probably place the best cluster in the right position. For the Eukaryotic phosphomannomutases, the best cluster is located in the core domain close to the magnesium ion, which is a part of the active site, but distant from the substrate in the open conformation of the protein.

The means of the minimal and the average distances to the ligand and the significance of the average distance are summarized in Table 2. It is not surprising that giving SDPs and CPs equal weight (λ = 1) leads to better results than λ = 0.5 for this dataset: since all the proteins have the same EC number, one might not expect to find anything accounting for differences in specificity among these proteins. As discussed above, the identified SDPs could, instead, reflect different ways that different groups of proteins evolved to perform the same function. Again, building the best cluster from only CPs does not improve the prediction quality, and all the distances used as the performance measures are larger than for λ = 1. This is another indication that proteins, even though they have the same function and specificity, have evolved quite different ways to perform it, and this must be taken into account when predicting functional sites.

In contrast to the diverse dataset, the success rate is similar for different versions of SDPsite (Table 6). In 10 out of 18 families the best cluster identified by taking into account both CPs and SDPs is significantly closer to the ligand (p-value < 0.01 in Mann-Whitney test), which indicates that specificity determinants predicted for homogeneous families, even when they do not illuminate the binding of a specific ligand, play some other important role in their function. In this regard, it is important to stress that this dataset is only homogenous as defined by EC numbers, and it is well established that these do not always uniquely define molecular function [45]. In other words, substantial sequence and structure diversity is possible even among sets of protein sharing the same EC number.

Table 6 Assessment of SDPsite versions, homogeneous dataset.

For example, the alignment of the queuine tRNA-ribosyltransferase family contains 9 bacterial and 4 archaeal proteins. When predicting SDPs, SDPsite clearly divides the family into these two groups. A closer analysis of the predicted SDPs reveals that some positions that are conserved within the two groups but differ between them bind substrate or tRNA in the bacterial enzyme from Zymomonas mobilis [46]. For example Cys 158 binds queuine precursor (substituted to proline in archaea), and Arg 286 binds tRNA (substituted to leucine).

c. Overall performance in the benchmark and guidelines for predictions

We analyzed the performance of SDPsite by calculating, for the diverse, homogenous and combined datasets, sensitivity (the ratio of number of true positives to the number of true positives plus false negatives) and false positive rate (ratio of number of false positives to the number of false positives plus true negatives) (Table 3). For a perfect prediction, sensitivity should be close to 1 and the false positive rate close to 0. As a gold standard set of residues in active sites, we considered all amino acids located within a distance of 10 Å from the bound ligand. Note that not all these residues are functionally important, which makes the reported false positive rate lower than it would be if we had perfect information on real functional importance of all residues in the proteins considered.

In all datasets the inclusion of SDPs in the predictions leads to a higher sensitivity. There is no significant difference between the diverse and homogeneous datasets, though this is probably due to the diverse nature of the underlying EC and sequence data as mentioned above: functional diversity is also likely present in the homogeneous dataset making SDPs beneficial to the prediction of functional sites.

As discussed above for the phosphomannomutase family, proteins often undergo conformational changes upon binding ligands, which means that any method tested on structures in complex with a ligand might unfairly profit from the use of a bound structure, which will differ from the unbound or apo form of the protein. To test for this effect, we identified 16 apo protein structures (of 26) from the diverse dataset and 14 (of 18) from the homogeneous dataset and found no significant difference in the predicted residues. This effectively means that the method is relatively insensitive to minor conformational changes that occur upon binding. It is worth noting, however, that the enzyme sites considered here might not be representative of other types of interactions that can lead to more substantial conformational re-arrangements (e.g. protein-protein interactions), or contain much larger functional sites than the tight constellations of functional residues normally found in enzymes.

The results from the above benchmark provides a guide for how to interpret predictions, which we used for the unannotated families below. Based on inspection of the results, we found the ratio of the total number of predicted SDPs and CPs to the length (i.e. (SDP+CP)/length) of alignment to be a useful measure of performance. For 29 of 44 families this ratio was below 0.2. For 17 of these 29 families all three versions of SDPsite predict the best cluster with an average distance to ligand less than 10 Å, and for 8 of these the p-value is < 0.01 in all three versions of the method. However, the set of families with successful predictions is not enriched with those with low (SDP+CP)/length ratio compared to all predictions. We also calculated the ratio of the number of residues in the best cluster to the length (best cluster/length) of alignment and applied the cutoff of 0.1. For 35 of total 44 (80%) families the best cluster/length ratio is below 0.1, whereas among families, for which average distance is <10 Å, and p-value < 0.01 for all three versions of SDPsite, this fraction is 10 out of 12 (83%). Thus it is practical first to consider predictions with a low best cluster/length. Still, as we discuss below, even more disperse predictions can lead to interesting insights.

As noted above, sometimes the best cluster can be situated on the inter-subunit contact interface, and the second best cluster in the active site. No clear strategy can be suggested to distinguish these situations in the absence of prior knowledge. In practice one should analyze the composition of the best cluster for the presence of amino acids typical for enzyme active sites (potential electron donors/acceptors) or for protein-protein contact interface (hydrophobic patches, etc.). These considerations, however, tend to be family-specific and thus cannot be included into a prediction algorithm intended for a general use.

It is worth noting that SDPsite performs best when the alignment is sufficiently long, the sequences are sufficiently diverse in terms of sequence identity and the phylogenetic tree is biologically sound. As a guideline, it performs best with alignments of at least 50 positions, and at least 10 proteins with identities between 30 and 90%. The diversity of possible sequences sets and functions for which SDPsite is applicable is great. From our benchmark, we were unable to distinguish between a single-EC and a multiple-EC family in a blind test. This is probably the major drawback of the method: in absence of additional information, one cannot conclude whether the identified SDPs pinpoint the real functional diversity within the family, or simply reflect the phylogenetic trace. It is impossible to say which of the three functional site prediction approaches is generally best. In practice, one should always run all three approaches, and then interpret based on what, if anything, is known about the function, and how the functional sites look on the structure.

2. Application to protein families lacking functional annotation

We focused on 193 Pfam families that included at least one protein with resolved 3D structure, and where all structures come from Structural Genomics Projects. After removing families with fewer than 6 sequences in the Pfam seed alignment, we were left with 124 (Additional file 2), for which potential functionally important sites could be identified. Of these, 54 families include potential or proven enzymes; 5 are transcription factors; 3 are involved in translation; 15 participate in various cellular processes in a fashion that is not completely understood; and the functional role for 47 is unknown. The full description of the predictions is available from the SDPsite web-site. We calculated SDP+CP/length and best cluster/length ratios for these families. For 50 of 124 families SDP+CP/length ratio was below 0.2 and the predicted cluster lies on the surface or in a pocket of the protein structure. For another 44 families predictions were weaker, but acceptable: SDP+CP/length ratio exceeds 0.2, but the cluster still lies in a pocket or on the surface. For the remaining 30, no reasonable predictions were made. Taken together, 76% of the predictions seem to provide a reasonable hint about the location of the actual functional site of the protein. The best cluster/length ratio is below 0.1 in 91 of 124 (73%) families. This fraction is lower than for the benchmark dataset, which can be due to the fact that the considered dataset includes proteins of various functions as opposed to the benchmark that consists solely of enzymes. The predictions made for these 91 families are the most promising candidates for experimental validation. Five promising examples are discussed below. These examples were chosen according to the criteria above, and our own visual inspection of the results.

a. YCII-related domain (PF03795)

This family was first identified during an analysis of the Streptomyces coelicolor genome by a domain hunting process [47]. The authors confined the annotation to a remark that it is "probably enzymatic". The study of this domain continued, when the first (and to date only) structure of a protein from this family, HI0828 from Haemophilus influenzae, was solved (PDB ID 1mwq, [48]). The protein was shown to adopt a ferrodoxin-like α/β-fold, and a catalytic mechanism involving a histidine-aspartate pair was proposed. However, biochemical assays have to-date failed to suggest a substrate for the enzyme.

SDPsite identifies a small cluster of potentially important residues (Fig. 3a), located in a pocket, which contains a coordinated ZnCl3, a ligand supposed to play a role in the catalysis. The method splits the family into three specificity groups (Fig. 3b). As the family includes a number of paralogues from the same species, which fall into different specificity groups, the identified sub-groups might reflect the real specificity differences within the family. The predicted SDPs might thus account for binding of different substrates in the active pocket.

Figure 3
figure 3

A. Structure of HI0828 from Haemophilus influenzae (1 mwq). SDPs are marked yellow, CPs are marked orange, best cluster is shown in spheres. Cl ions are shown in green, Zn ions in brown. B. Phylogenetic tree of the YCII-related domain (PF03795) family. The predicted specificity groups are shown as gray ovals.

b. CobW/HypB/UreG, nucleotide-binding domain (PF02492), and Cobalamin synthesis protein cobW C-terminal domain (PF07683)

These two protein families represent the two domains of CobW protein, a hypothetical protein, involved in cobalamin biosynthesis and possibly required to store cobalt ions as an intermediary between the cobalt transport and chelation systems [49]. The N-terminal part of CobW is a member of PF02492, which also contains nucleotide-binding domains of HypB and UreG, GTPases involved in binding of nickel to apoenzymes [50, 51]. The only known 3D structure in both families belongs to a hypothetical protein YjiA from E.coli [31] (PDB ID 1nij). YijA is a homolog of CobW and has the same domain structure, but a much shorter linker between the domains, and also lacks the histidines required for metal binding, and thus cannot serve as a metal repository. YijA is believed to be a GTP-dependent regulator, however its biological role is unclear [52]. The N-terminal domain of YjiA (which corresponds to PF02492) has a fold typical to P-loop NTP-binding proteins, and a number of motifs responsible for GTP binding can be found in it. The C-terminal domain has a ferrodoxin-like fold.

SDPsite identified two potential functionally important sites, one for each domain (Fig. 4). Both clusters are exposed to solvent and the cluster in the N-terminal domain is located close to the possible nucleotide-binding pocket, identified by Walker A and B motifs, as discussed in ref. [49] (arrow in Fig. 4). The arrangement of the clusters suggests their possible role in communication between the two domains. Indeed, in the structure used the minimal distance between residues of the N-terminal and the C-terminal domain cluster is 5.5 Å. As the domains are rather mobile relative to each other due to the flexible linker, under certain conditions the two clusters could potentially interact directly.

Figure 4
figure 4

Structure of YjiA from E.coli(1nij). N-terminal domain is shown in pink, linker is shown in light green, C-terminal domain is shown in light blue. SDPs are marked yellow, CPs are marked orange in the N-terminal domain and cyan and magenta, respectively, in the C-terminal domain, best cluster is shown in spheres. The red arrow indicates the position of the nucleotide-binding pocket.

c. PHP domain (PF02811)

This family corresponds to a catalytic domain with a phosphoesterase activity found both as a stand-alone protein and fused to DNA polymerase domains. PHP domains are often found in the N-terminal part of bacterial DNA polymerase III α subunit and in the C-terminal part of DNA polymerase of the X family in some archaea. In this role, the PHP domain is proposed to hydrolyze pyrophosphate, shifting reaction equilibrium to polymerization [53]. The family also includes a number of tyrosine-protein phosphotases and histidinol phosphatases and many uncharacterized proteins. 3D structures are available for two proteins of the family, both with unknown function: YcdX from E.coli and tm0559 from Thermotoga maritima.

We mapped the predicted positions on the structure of YcdX from E.coli [54] (PDB code 1m68) (Fig. 5), since the predicted clusters in all available structures are formed by equivalent residues (data not shown). YcdX is usually present as a trimer in solution, each monomer possessing its own catalytic site [54] (Fig. 5A, B and 5C show a closer view of one monomer of the complex). A cluster of three zinc ions is located in a cleft of the structure, indicating a possible location of the active site (marked with an arrow in Fig. 5A). The predicted cluster of functionally important residues lies close to the zinc cluster and has a layered form: a more compact layer of CPs and a fuzzier layer of SDPs. CPs might represent the catalytic core of the active site and SDPs, a less spatially defined recognition periphery. Some SDPs even protrude to the back of the monomer, where they can participate in the ligand recognition in another active site.

Figure 5
figure 5

Structure of YcdX from E.coli(1m68). A. YcdX trimer. Putative location of the active site is indicated with an arrow. SDPs are marked in yellow, CPs in orange, the best cluster is shown in spheres. B. YcdX monomer. Side view. C. YcdX monomer. Front view.

It was previously noted [53] that proteins of the PHP family appear as single domain proteins or as domains in multi-domain proteins involved in a variety of cellular processes, possibly exhibiting diverse specificity. Indeed, one can find proteins annotated as "DNA polymerase III subunit alpha", "DNA-dependent DNA polymerase family X", "Histidinol-phosphatase", "Tyrosine-protein phosphatase" and "Protein trpH" in this family. Many of the family members are still uncharacterized. Along with the identification of a cluster of possible functionally important residues, SDPsite extracts a number of specificity groups, which are shown on the phylogenetic tree on the family in Fig. 6. Proteins with similar annotation always fall into same group, and, with a certain degree of caution, this annotation can be transferred onto other proteins of the same group. For example, YcdX falls into the same group with a sequence from Methanobacterium thermoautotrophicum (O26650), which is annotated as a DNA-dependent DNA polymerase family X, known to be involved in DNA repair. Analysis of gene expression level in stress conditions indicates that YcdX might also be involved in DNA repair [55].

Figure 6
figure 6

Phylogenetic tree of the PHP domain (PF02811) family. Specificity groups are shown as gray ovals. The predominant annotation for certain groups is indicated beside them. When only one protein of a group has a functional annotation, it is put in italics and indicated by an arrow.

d. Possible lysine decarboxilase (PF03641)

According to Pfam, this family includes proteins annotated as lysine decarboxylases, but the evidence for this annotation is not clear. Three structures from this family are available (YvdD from Bacillus subtilis, 1t35; tm1055 from Thermotoga maritima, 1rcu; and at5g11950 from Arabidopsis thaliana, 1ydh), none of them with clear functional annotation. One can clearly see a cleft lined with CPs with SDPs on its periphery (Fig. 7A, B and 7C show different possible oligomeric states), which could form an active pocket. It is also possible that there is one active pocket per every two subunits (see Fig. 7C).

Figure 7
figure 7

Structure of YvdD from Bacillus subtilis(1t35). A. YvdD octamer. Suggested biological unit (from PDB entry 1t35). The color coding is as for Fig. 5. B. YvdD monomer. C. YvdD dimer. Note that the active site might be located between the two subunits.

e. YqeY-like proteins (PF09424)

YqeY-like proteins are widespread, but the largest portion of the family comes from bacteria (341 sequences vs. 17 in Eukarya, 1 in viruses and none in Archaea). Predominantly, there is only one copy per genome. Despite the great number of proteins in the family, their function remains unclear. A number of SDPs and CPs were identified in this protein (Fig. 8). The best cluster is a tight cluster of four SDPs at the C-terminal end of the second α-helix, and exposed to solvent. This positioning might suggest its involvement in some protein-protein interactions. The functional role of these residues, if any, remains to be tested.

Figure 8
figure 8

Structure of YqeY from Bacillus subtilis(1ng6). The color coding is as for Fig. 5.

3. Comparison to other functional site prediction methods

We compared the performance of SDPsite to several other techniques which also aim to predict functionally important residues from protein sequence and structure information: 1) the Evolutionary Trace (ET) method [11, 12], 2) the ConSurf method [9], and 3) the three methods from the TreeDet package [30]. To produce predictions with the ET method, we used the TraceSuite II server For ConSurf we used the Rate4site implementation [56], which assigns an evolutionary rate to each position of the alignment. The more a position is conserved, the more important for the function it is considered to be. As the algorithm produces a rating, rather than a set of functionally important positions, we considered the two most conserved of the nine output categories as the prediction.

We used the two benchmark datasets and the assessment procedure described in the above to assess the predictions of different methods. The minimal distance to the ligand, the average distance and the significance of the average distance are presented in Additional file 3. The predictions of the ET method usually show low minimal distance to the ligand, but due to the fact that ET renders too many positions as potentially important (up to 60% of the alignment length), which is impractical in experimental studies. We performed Mann-Whitney tests comparing the distribution of distances for the predictions produced by SDPsite to various other methods (Table 7). In 32 and 34 of 44 cases at least one of the three versions of the method (λ = 0.5, λ = 1 or no SDPs) performed significantly better (p-value < 0.01) than ET. and Rate4Site, respectively. FASS, MB and S methods perform well, but frequently identify too few amino acids to make the test reliable, or are not applicable at all due to the methods' limitations. SDPsite thus achieves a good balance between the quality and quantity of the predicted residues and the general applicability of the method to further investigations.

Table 7 Assessment of different methods.


We have presented a new approach for automated identification of functional sites in protein structures, SDPsite. Its main advantage is that it considers not only conservation, but also specific differences among proteins from a protein family. A number of approaches aim to predict specificity determinants [2541] and/or conserved functionally important sites [924], but they almost never combine structural and sequence information. The closest attempt to apply this kind of analysis to prediction of functional sites was the Evolutionary Trace (ET) method [11, 12]. However, the ET approach takes into account all positions conserved within groups that arise when cutting the phylogenetic tree of the family at a certain distance from the root, thus often leading to too many predicted positions. SDPsite also extracts all possible groups of proteins corresponding to different identity cutoffs, but, unlike ET, it selects the least probable to arise by chance, i.e. the most probable functional grouping.

We tested SDPsite on two benchmark datasets of enzyme families, for which details of function and binding mode of ligands were known, and specificity of proteins was, in one case, differing within the family, and in the other case, same for all proteins of the family. The tests show that for almost all cases SDPsite succeeds to predict the functional site (i.e. the active cleft of the enzyme) with a good accuracy. The average values of the minimal and average distances from the best cluster to the ligand over all families (Table 2) demonstrate that the predicted best cluster is compactly situated in the active site of the enzymes in most families. As noted above, the prediction of specificity determinants is the main novel feature of our method, but ignoring them and using only conserved positions to build the best cluster can lead to a decrease of the sensitivity of the method (Additional file 1, Table 3). Using this strategy some residues crucial for the function are missed (e.g., substrate-binding residues in tRNA-ribosyltransferases, see Results). Thus clusters that include SDPs often define a more practically useful set of functional residues. Comparisons to other methods for functional site predictions show that SDPsite is marginally or substantially better than the previous approaches, supporting the notion that the inclusion of specificity determining residues generally augments predictions of functional sites.

We provide predictions of functional sites for a large number of proteins, whose structures were solved in a high-throughput manner and lack functional annotation. In some cases (e.g. YCII-related domain, PHP domain, see Results section) we provide evidence that a family of homologous proteins includes proteins of different specificity and suggest composition of specificity groups. When specificity of some members of these groups is known, one can transfer it, with caution, to uncharacterized proteins, assisting their annotation (see the example of PHP domain).


We have introduced a novel approach to identification of functionally important sites in protein structures. It bridges the gap between two existing categories of methods dealing with the same problem: sequence-centric comparative techniques and the structure-centric techniques that incorporate some physico-chemical considerations. Sequence data are already overwhelming, and structures lacking functional annotation are likely to increase in number for many years to come. Thus practical tools like that described will be of great benefit for those interested in inferring function directly from structure.


SDPsite algorithm

The approach takes as input the multiple sequence alignment of a protein family, the corresponding phylogenetic tree and a 3D structure of one of the family members. The algorithm is composed of three basic steps: (1) prediction of specificity determining positions (SDPs) (at this step, the sequences are automatically divided into a number of specificity groups, i.e. groups of proteins with presumably the same specificity); (2) prediction of conserved positions (CPs); and (3) identification of the best spatial cluster.

Prediction of specificity determining positions

We previously developed an algorithm for prediction of specificity determining positions (SDPs) [25]. Given a set of protein sequences divided into a number of specificity groups, this method searches for the positions of the alignment, for which the distribution of amino acids is best correlated with the specificity groups. We suppose that proteins from the same specificity group have the same specificity, and specificity differs for proteins from different groups. Thus, the identified positions are likely to confer the specific differences in the function.

To assess the correlation, we use mutual information of each position p of the alignment:

where f p (α, i) is the frequency of amino acid α in position p in specificity group i, f p (α)is frequency of the amino acid α in position p in the whole alignment, f(i) is the fraction of group i in the whole alignment. To account for the biological nature of the data, we introduce a number of corrections, which we discuss in detail in [25]. Then we calculate the mean and the variance of the expected mutual information and the statistical significance (Z-scores) for each position:

Then, to identify the number of highest-scoring positions, which are most likely to be SDPs, for each k positions with highest Z-scores, we compute the probability of getting this many Z-scores greater or equal to the smallest of the k by chance, assuming the normal distribution of the Z-scores. k* positions, for which this probability is the lowest, are the predicted SDPs:


The probability of these k* SDPs, P* = P{there are at least k cases: ZZ(k)}, is called the statistical significance of the set of k* SDPs.

Determining specificity groups for a multiple sequence alignment

To allow large-scale application of SDPsite, we developed a procedure for automated identification of specificity groups from a multiple sequence alignment of the family alone. We applied the idea from Lichtarge and colleagues [11]: we place the root of the phylogenetic tree in the midpoint of the longest path between two leaves and consider the sets of groupings derived from cutting the tree on different distances from the root (Fig. 9). We ignore groups with fewer than three sequences. Then for each grouping we compute the SDPs as described above and select the grouping with the lowest P*, i.e. the least probable to arise by chance set of SDPs.

Figure 9
figure 9

Automated grouping procedure. Two possible groupings are shown in red and in blue.

We need to correct for the fact that larger groups tend to produce larger Z-scores. When considering the same alignment and increasing the number of sequences in each group, Z-scores can be approximated by a log function (data not shown). Thus we divide Z-scores by the logarithm of the average number of sequences in a specificity group.

Prediction of conserved positions

For each position p of the alignment, we assessed its conservation using the Sander-Schneider conservation measure [57]:

where N is the number of sequences in the alignment, d(s i , s j ) is the distance between sequences s i and s j , equal to , s i (p) and s j (j) is the amino acid in position p in sequence, s i and s j respectively, M(α, β) is the BLOSUM62 amino acid substitution matrix.

Analogous to SDPs, we calculate Z-score for each position:

is the background distribution generated by 10000 times picking random positions from the alignment to form the random alignment column. This corresponds to conservation in a set of unaligned sequences. Since aligning itself introduces some bias, we correct for it by centering Z-scores: . Then we select the least probable set of CPs using procedures described in the previous section.

Construction of the best cluster

To compute SDPs and CPs we do not need the 3D structure of proteins of the family. We make use of it when constructing the best cluster for the first time. When there are resolved 3D structures for more than one protein of the family, the resulting best cluster may differ depending on the structure, but our tests show that it is usually not the case (data not shown).

We map the predicted SDPs and CPs onto the structure and construct the best cluster using the layered clustering based on tightness of set function algorithm [58] as follows. Consider a graph where vertices correspond to the predicted SDPs and CPs, and an edge connects every pair of vertices. Let H0 be the union of all vertices of the graph. Then for each i we calculate its weight: , where the sum is over all other vertices j from H0, ω ij is the weight of the edge connecting i and j, calculated as:

where d ij is the Euclidian distance between closest atoms of amino acids corresponding to vertices i and j, R = 5 Å is the average distance between centers of atoms in contact, D = 15 Å is the influence distance of the atom. R and D are empirical constants. λ i = 0.5 if the vertex corresponds to a CP, 1 otherwise. Thus CPs get twice as little weight as SDPs,. The rationale for this is the fact that in the benchmark datasets there is a clear tendency for more SDPs in the active site for families from the diverse dataset (10 SDPs on average closer than 10 Å to the ligand, opposed to 7.38 CPs). The equal measure derives from the fact that the homogeneous dataset has no such bias, (10.11 and 11.83 SDPs and CPs, respectively). For an unknown family then, it is logical to try both parameter settings.

We then find a subset F0 H0, for all vertices of which μ is minimal and equals . Let H1 = H0\F0. We repeat this until we get an empty set at some point. Thus we construct a series of layered clusters H0H1 ∟ ... ∟ H N ∟ Ψ. We select the cluster H n , for which and call it the best cluster.

In certain cases we refer to a second best cluster. It is constructed analogous to the best cluster, after all amino acids from the best cluster are excluded from the initial set.

Construction of the datasets

Benchmark datasets

To assess the performance of SDPsite, we built two benchmark datasets containing enzyme families with well-characterized function and at least one structure with bound natural ligands or their close analogs. The first dataset corresponds to the situation of a protein family, which acts on a variety of substrates, the second accounts for the case when all the proteins specificities are the same. We need these two datasets, because we do not know in advance to which of the two types the query family belongs.

We used alignments from the Pfam database [59]. The first dataset, termed diverse, consists of families, for which EC numbers differing in the last digit (usually accounting for the enzyme specificity) were assigned to at least two proteins. The second dataset was termed homogeneous and includes families, where all the proteins have the same EC number. The diverse dataset mimics the situation when the family in question contains proteins that bind different ligands (e.g. hydrolases of different specificity); the homogeneous dataset approximates sets of proteins that perform a single specific function in the cell (e.g. DNA polymerase), though (see Results) EC number is not necessarily sufficient to do this. The statistics and the bound ligands for these dataset are given in Tables 1 and 5.

To evaluate the performance of SDPsite for these datasets, we developed several statistical measures. Let d(i, j) be the distance between the closest atoms of amino acids i and j (j can represent the bound ligand or the set of bound ligands as well), ψ be a set of predicted residues, either all SDPs, all CPs, or the best cluster, n(Ψ) be the number of elements in Ψ. We calculate:

  1. 1)

    Minimal distance to the ligand: ;

  2. 2)

    Average distance to the ligand: ;

  3. 3)

    Diameter of the predicted set of SDPs, CPs and the best cluster: ;

  4. 4)

    Average pairwise distance between the residues within the predicted set: .

We also compute the significance of the average distance, i.e. the fraction of atoms that are located this close or closer to the ligand. This gives us an idea of which portion of the protein our prediction covers: the smaller the significance, the more precise the predictions.

Test dataset

We selected 193 Pfam families based on the following criteria: (1) the function of the family must be unknown or poorly characterized; (2) the family must include a protein with a 3D structure resolved in the framework of the Structural Genomics initiative or lack clear functional annotation; and (3) there should be no proteins with well-studied structure or function. A structure was considered as poorly annotated and coming from SG project, if the PDB file contained the words "structural genomics", "hypothetical protein" or "unknown function" in its HEADER or KEYWDS sections. Then we excluded families that consist of fewer than 6 sequences, because it not possible to predict SDPs for sets smaller than this size, and ended up with 124 families (listed in Additional file 2). All the alignments were downloaded from Pfam [59] and the corresponding phylogenetic trees from Pandit [60].

A web server to use SDPsite is available at: and The data for all predictions discussed is also available on this web site.


  1. Holm L, Sander C: Dali: a network tool for protein structure comparison. Trends Biochem Sci 1996, 20: 478–480. 10.1016/S0968-0004(00)89105-7

    Article  Google Scholar 

  2. Taylor WR, Flores TP, Orengo CA: Multiple protein structure alignment. Protein Sc 1994, 3: 1858–1870. 10.1002/pro.5560031025

    Article  CAS  Google Scholar 

  3. Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11: 739–747. 10.1093/protein/11.9.739

    Article  CAS  PubMed  Google Scholar 

  4. Kleywegt GJ: Recognition of spatial motifs in protein structures. J Mol Biol 1999, 285: 1887–1897. 10.1006/jmbi.1998.2393

    Article  CAS  PubMed  Google Scholar 

  5. Stark A, Russell RB: Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures. Nucleic Acids Res 2003, 31: 3341–3344. 10.1093/nar/gkg506

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Porter CT, Bartlett GJ, Thornton JM: The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 2004, (32 Database):D129–133. 10.1093/nar/gkh028

    Google Scholar 

  7. Dodson G, Wlodawer A: Catalytic triads and their relatives. Trends Biochem Sci 1998, 23: 347–352. 10.1016/S0968-0004(98)01254-7

    Article  CAS  PubMed  Google Scholar 

  8. Stark A, Shkumatov A, Russell RB: Finding functional sites in structural genomics proteins. Structure 2005, 12: 1405–1412. 10.1016/j.str.2004.05.012

    Article  Google Scholar 

  9. Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 2005, (33 Web Server):W299–302. 10.1093/nar/gki370

    Google Scholar 

  10. Aloy P, Querol E, Aviles FX, Sternberg MJ: Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol 2001, 311: 395–408. 10.1006/jmbi.2001.4870

    Article  CAS  PubMed  Google Scholar 

  11. Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defined binding surfaces common to protein families. J Mol Biol 1996, 257: 342–358. 10.1006/jmbi.1996.0167

    Article  CAS  PubMed  Google Scholar 

  12. Yao H, Kristensen DM, Mihalek I, Sowa ME, Shaw C, Kimmel M, Karvaki L, Lichtarge O: An accurate, sensitive, and scalable method to identify functional sites in protein structures. J Mol Biol 2003, 326: 255–261. 10.1016/S0022-2836(02)01336-0

    Article  CAS  PubMed  Google Scholar 

  13. del Sol Mesa A, Pazos F, Valencia A: Automatic methods for predicting functionally important residues. J Mol Biol 2003, 326: 1289–1302. 10.1016/S0022-2836(02)01451-1

    Article  CAS  PubMed  Google Scholar 

  14. Polacco BJ, Babbitt PC: Automated discovery of 3D motifs for protein function annotation. Bioinformatics 2006, 22: 723–730. 10.1093/bioinformatics/btk038

    Article  CAS  PubMed  Google Scholar 

  15. Capra J, Singh M: Predicting functionally important residues from sequence conservation. Bioinformatics 2007, 23: 1875–1882. 10.1093/bioinformatics/btm270

    Article  CAS  PubMed  Google Scholar 

  16. Manning JR, Jefferson ER, Barton GJ: The contrasting properties of conservation and correlated phylogeny in protein functional residue prediction. BMC Bioinformatics 2008, 9: 51. 10.1186/1471-2105-9-51

    Article  PubMed Central  PubMed  Google Scholar 

  17. Landgraf R, Xenarios I, Eisenberg D: Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 2001, 307: 1487–1502. 10.1006/jmbi.2001.4540

    Article  CAS  PubMed  Google Scholar 

  18. Ma B, Elkayam T, Wolfson H, Nussinov R: Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc Natl Acad Sci USA 2003, 100: 5772–5777. 10.1073/pnas.1030237100

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Ahmad S, Keskin O, Sarai A, Nussinov R: Protein-DNA interactions: structural, thermodynamic and clustering patterns of conserved residues in DNA-binding proteins. Nucleic Acids Res 2008, 36: 5922–5932. 10.1093/nar/gkn573

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Shulman-Peleg A, Shatsky M, Nussinov R, Wolfson HJ: Prediction of interacting single-stranded RNA bases by protein-binding patterns. J Mol Biol 2008, 379: 299–316. 10.1016/j.jmb.2008.03.043

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Hendlich M, Rippmann F, Barnickel G: LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model 1997, 15: 359–363. 10.1016/S1093-3263(98)00002-3

    Article  CAS  PubMed  Google Scholar 

  22. Laurie ATR, Jackson RM: Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21: 1908–1916. 10.1093/bioinformatics/bti315

    Article  CAS  PubMed  Google Scholar 

  23. Koczyk G, Wyrwicz LS, Rychlewski L: LigProf: a simple tool for in silico prediction of ligand-binding sites. J Mol Model 2007, 13: 445–455. 10.1007/s00894-006-0165-4

    Article  CAS  PubMed  Google Scholar 

  24. Dundas J, Ouyang Z, Tseng J, Binowski A, Turpaz Y, Liang J: CASTp: computed atlas of surface topography of proteins with atructural and topographical mapping of functionally annotated residues. Nucl Acids Res 2006, 34: W116-W118. 10.1093/nar/gkl282

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. Kalinina OV, Mironov AA, Gelfand MS, Rakhmaninova AB: Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci 2004, 13: 443–456. 10.1110/ps.03191704

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  26. Hannenhalli SS, Russell RB: Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 2000, 303: 61–76. 10.1006/jmbi.2000.4036

    Article  CAS  PubMed  Google Scholar 

  27. Mirny LA, Gelfand MS: Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J Mol Biol 2002, 321: 7–20. 10.1016/S0022-2836(02)00587-9

    Article  CAS  PubMed  Google Scholar 

  28. Gaucher EA, Gu X, Miyamoto MM, Benner SA: Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem Sci 2002, 27: 315–321. 10.1016/S0968-0004(02)02094-7

    Article  CAS  PubMed  Google Scholar 

  29. Pei J, Cai W, Kinch LN, Grishin NV: Prediction of functional specificity determinants from protein sequences using log-likelihood ratios. Bioinformatics 2006, 22: 164–171. 10.1093/bioinformatics/bti766

    Article  CAS  PubMed  Google Scholar 

  30. Carro A, Tress M, de Juan D, Pazos F, Lopez-Romero P, del Sol A, Valencia A, Rojas AM: TreeDet: a web server to explore sequence space. Nucl Acids Res 2006, 34: W110-W115. 10.1093/nar/gkl203

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Donald JE, Shakhnovich EI: Predicting specificity-determining residues in two large eukaryotic transcription factor families. Nucl Acids Res 2005, 33: 4455–4465. 10.1093/nar/gki755

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  32. Chakrabarti S, Bryant SH, Panchenko AR: Functional specificity lies within the properties and evolutionary changes of amino acids. J Mol Biol 2007, 373: 801–810. 10.1016/j.jmb.2007.08.036

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  33. Ye K, Feenstra KA, Heringa J, IJzerman AP, Marchiori E: Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a machine-learning approach for feature weighting. Bioinformatics 2008, 24: 18–25. 10.1093/bioinformatics/btm537

    Article  CAS  PubMed  Google Scholar 

  34. Feenstra KA, Pirovano W, Krab K, Heringa J: Sequence harmony: detecting functional specificity from alignments. Nucl Acids Res 2007, 35: W495-W498. 10.1093/nar/gkm406

    Article  PubMed Central  PubMed  Google Scholar 

  35. Reva B, Antipin Y, Sander C: Determinants of protein function revealed by combinatorial entropy optimization. Genome Biol 2007, 8: R232. 10.1186/gb-2007-8-11-r232

    Article  PubMed Central  PubMed  Google Scholar 

  36. Wallace IM, Higgins DG: Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics 2007, 8: 135. 10.1186/1471-2105-8-135

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  37. Ye K, Vriend G, IJzerman AP: Tracing evolutionary pressure. Bioinformatics 2008, 24: 908–915. 10.1093/bioinformatics/btn057

    Article  CAS  PubMed  Google Scholar 

  38. Edwards RJ, Shields DC: BADASP: predicting functional specificity in protein families using ancestral sequences. Bioinformatics 2005, 21: 4190–4191. 10.1093/bioinformatics/bti678

    Article  CAS  PubMed  Google Scholar 

  39. Mayer KM, McCorkle SR, Shanklin J: Linking enzyme sequence to function using conserved property difference locator to identify and annotate positions likely to control specific functionality. BMC Bioinformatics 2005, 6: 284. 10.1186/1471-2105-6-284

    Article  PubMed Central  PubMed  Google Scholar 

  40. Capra JA, Singh M: Characterization and prediction of residues determining protein functional specificity. Bioinformatics 2008, 24: 1473–1480. 10.1093/bioinformatics/btn214

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  41. Sankararaman S, Sjolander K: INTREPID – INformation-theoretic TREe traversal for Protein functional site IDentification. Bioinformatics 2008, 24: 2445–2452. 10.1093/bioinformatics/btn474

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  42. Kalinina OV, Russell RB, Rakhmaninova AB, Gelfand MS: Computational method for prediction of protein functional sites using specificity determinants. Mol Biol (Mosk) 2007, 41: 137–147. 10.1134/S0026893307010189

    Article  CAS  Google Scholar 

  43. Chandonia J-M, Brenner SE: The impact of structural genomics: expectations and outcomes. Science 2006, 311: 347–351. 10.1126/science.1121018

    Article  CAS  PubMed  Google Scholar 

  44. Wu G, Fiser A, ter Kuile B, Sali A, Müller M: Convergent evolution of Trichomonas vaginalis lactate dehydrogenase from malate dehydrogenase. Proc Natl Acad Sci USA 1999, 96: 6285–6290. 10.1073/pnas.96.11.6285

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  45. Todd AE, Orengo CA, Thornton JM: Plasticity of enzyme active sites. Trends Biochem Sci 2002, 27: 419–426. 10.1016/S0968-0004(02)02158-8

    Article  CAS  PubMed  Google Scholar 

  46. Romier C, Reuter K, Suck D, Ficner R: Crystal structure of tRNA-guanine transglycosylase: RNA modification by base exchange. EMBO J 1996, 15: 2850–2857.

    PubMed Central  CAS  PubMed  Google Scholar 

  47. Yeats C, Bentley S, Bateman A: New knowledge from old: in silico discovery of novel protein domains in Streptomyces coelicolor. BMC Microbiol 2003, 3: 3. 10.1186/1471-2180-3-3

    Article  PubMed Central  PubMed  Google Scholar 

  48. Willis MA, Song F, Zhuang Z, Krajewski W, Chalamasetty VR, Reddy P, Howard A, Dunaway-Mariano D, Herzberg O: Structure of YciI from Haemophilus influenzae (HI0828) reveals a ferredoxin-like alpha/beta-fold with a histidine/aspartate centered catalytic site. Proteins 2005, 59: 648–652. 10.1002/prot.20411

    Article  CAS  PubMed  Google Scholar 

  49. Rodionov DA, Vitreschak AG, Mironov AA, Gelfand MS: Comparative genomics of the vitamin B12 metabolism and regulation in prokaryotes. J Biol Chem 2003, 278: 41148–41159. 10.1074/jbc.M305837200

    Article  CAS  PubMed  Google Scholar 

  50. Maier T, Jacobi A, Sauter M, Böck A: The product of the hypB gene, which is required for nickel incorporation into hydrogenases, is a novel guanine nucleotide-binding protein. J Bacteriol 1993, 175: 630–635.

    PubMed Central  CAS  PubMed  Google Scholar 

  51. Zambelli B, Musiani F, Savini M, Tucker P, Ciurli S: Biochemical studies on Mycobacterium tuberculosis UreG and comparative modeling reveal structural and functional conservation among the bacterial UreG family. Biochemistry 2007, 46: 3171–3182. 10.1021/bi6024676

    Article  CAS  PubMed  Google Scholar 

  52. Khil PP, Obmolova G, Teplyakov A, Howard AJ, Gilliland GL, Camerini-Otero RD: Crystal structure of the Escherichia coli YjiA protein suggests a GTP-dependent regulatory function. Proteins 2004, 54: 371–374. 10.1002/prot.10430

    Article  CAS  PubMed  Google Scholar 

  53. Aravind L, Koonin EV: Phosphoesterase domains associated with DNA polymerases of diverse origins. Nucleic Acids Res 1998, 26: 3746–3752. 10.1093/nar/26.16.3746

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  54. Teplyakov A, Obmolova G, Khil PP, Howard AJ, Camerini-Otero RD, Gilliland GL: Crystal structure of the Escherichia coli YcdX protein reveals a trinuclear zinc active site. Proteins 2003, 51: 315–318. 10.1002/prot.10352

    Article  CAS  PubMed  Google Scholar 

  55. Khil PP, Camerini-Otero RD: Over 1000 genes are involved in the DNA damage response of Escherichia coli. Mol Microbiol 2002, 44: 89–105. 10.1046/j.1365-2958.2002.02878.x

    Article  CAS  PubMed  Google Scholar 

  56. Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N: Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 2002, 18(Suppl 1):S71–77.

    Article  PubMed  Google Scholar 

  57. Valdar WSJ: Scoring residue conservation. Proteins 2002, 48: 227–241. 10.1002/prot.10146

    Article  CAS  PubMed  Google Scholar 

  58. Mirkin B, Muchnik I: Layered clusters of tightness set functions. Appl Math Lett 2002, 15: 147–151. 10.1016/S0893-9659(01)00109-4

    Article  Google Scholar 

  59. Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res 2006, (34 Database):D247–251. 10.1093/nar/gkj149

    Google Scholar 

  60. Whelan S, de Bakker PI, Quevillon E, Rodriguez N, Goldman N: PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucl Acids Res 2006, 34: D327-D331. 10.1093/nar/gkj087

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references


O.V.K. was supported by the EMBO Long Term Fellowship (ALTF 2007-119). MSG was supported by grants from the Howard Hughes Medical Institute (55005610), the Russian Academy of Science (Program Molecular and Cellular Biology). The study was partly funded by the European Commission under FP6, contract LSHG-CT-2005-512028, the Russian Foundation for Basic Research (RFBR 07-04-91555).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Olga V Kalinina.

Additional information

Authors' contributions

OVK performed the collection of data, wrote the program, participated in the analysis and drafted the manuscript. MSG and RBR participated in the design of the study and the analysis of results, and helped to draft the manuscript. All authors read and approved the manuscript.

Mikhail S Gelfand and Robert B Russell contributed equally to this work.

Electronic supplementary material


Additional File 1: Distance measures designed to assess the quality of the predictions of SDPsite. A detailed list of different measures based on the distance between the predicted residues, designed to assess the quality of the prediction. (DOC 512 KB)


Additional File 2: Poorly characterized Pfam families, used in the analysis. A list of all Pfam families included in the de novo prediction of functional site, grouped by function. (DOC 152 KB)


Additional File 3: Assessment of the quality of the predicions of SDPsite and other available methods. A detailed comparison of performance of different methods for prediction of functional site, using same distance measures as in Additional file 1. (DOC 214 KB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kalinina, O.V., Gelfand, M.S. & Russell, R.B. Combining specificity determining and conserved residues improves functional site prediction. BMC Bioinformatics 10, 174 (2009).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: