Functional annotation by identification of local surface similarities: a novel tool for structural genomics

Background Protein function is often dependent on subsets of solvent-exposed residues that may exist in a similar three-dimensional configuration in non homologous proteins thus having different order and/or spacing in the sequence. Hence, functional annotation by means of sequence or fold similarity is not adequate for such cases. Results We describe a method for the function-related annotation of protein structures by means of the detection of local structural similarity with a library of annotated functional sites. An automatic procedure was used to annotate the function of local surface regions. Next, we employed a sequence-independent algorithm to compare exhaustively these functional patches with a larger collection of protein surface cavities. After tuning and validating the algorithm on a dataset of well annotated structures, we applied it to a list of protein structures that are classified as being of unknown function in the Protein Data Bank. By this strategy, we were able to provide functional clues to proteins that do not show any significant sequence or global structural similarity with proteins in the current databases. Conclusion This method is able to spot structural similarities associated to function-related similarities, independently on sequence or fold resemblance, therefore is a valuable tool for the functional analysis of uncharacterized proteins. Results are available at


Background
Detection of sequence or fold similarity is often used to infer the function of uncharacterized proteins. By this approach one can tentatively assign a function to approximately 45-80% of the proteins identified by the genomic projects [1,2]. However, function is mostly determined by the physical, chemical and geometric properties of the protein surfaces [3,4], and cases have been described where the same local spatial distribution of residues important for function is achieved with apparently unrelated structures and/or sequences [5]. One of the best known examples is represented by the SHD catalytic triad of serine proteinases [6][7][8]. Furthermore, surface similarities have been detected in unrelated ATP/GTP binding proteins [9,10] and in the guanine binding sites of p21Ras family GTPases or in the RNA binding site of bacterial ribonucleases [10]. By local structural comparison Hwang et al. [11] were able to infer correctly the nucleotide binding ability of an uncharacterized Methanococcus jannaschii protein.
On the other hand, similar folds can have different functions if their active sites have diverged [12][13][14][15]. As a consequence, methods purely relying on sequence and global structure comparison may lead to inaccurate functionrelated annotations in cases in which few residues are responsible for the specificity of substrate interaction.
The vast majority of well-studied functions (enzymatic activities, binding abilities etc.) are encoded by a relatively small set of residues, often not contiguous in the protein sequence but organized in a conserved geometry on the protein surface that may be used as a marker for reliable functional annotation. Although exposed to the solvent, these function-related residues are often located in surface clefts or cavities [16]. Such residues define functional modules conserved in some proteins sharing a molecular function even if differing in sequence and structure. Several tools for discovering conserved three-dimensional patterns in protein structures have already been proposed [17][18][19][20]. Schmitt et al. [21] developed a clique-based method to detect functional relationships among proteins. This approach does not rely on detection of sequence or fold homology and highlights a number of non-obvious similarities among protein cavities. The algorithm, however, is computationally intensive and cannot be applied to an all-against-all analysis of protein surface regions. Binkowski and co-workers [22] recently described an approach for detecting sequence and spatial patterns of protein surfaces: the underlying algorithm is fast, but cannot identify similarities that are independent of the residue order in the compared proteins. Two related papers [23,24] describe a method for local structural similarity detection, which is of great relevance since it is able to evaluate the statistical significance of each match. This method (PINTS) has been then used to analyze protein structures from structural genomics projects [25]. Other recent papers present algorithms able to find structural motifs possibly related to a function and to use them to scan protein structure libraries [26][27][28][29][30][31].
In a previous work [32] we described the construction of a non redundant library of surface annotated functional sites and a fast comparison algorithm able to find structural similarities independently on the residue sequence order. We report here the analysis of the results of the first all-versus-all comparison of the protein functional sites, the validation of the comparison procedure in a test dataset and its application for annotating a dataset composed of proteins solved in structural genomics projects. The results are available for experimental test at the address http://cbm.bio.uniroma2.it/surface/structuralGenom ics.html.

Functional sites comparison
We used the compendium of protein surface regions associated to molecular functional sites stored in the SURFACE database [32]. This is a collection of 1521 annotated functional regions obtained following the procedure described in Figure 1 and in the Methods section. Each patch has at least a function-related annotation, that may be the ability to bind a certain ligand, or a match with a PROSITE or ELM pattern [33,34]. Ligand-binding abilities are included among gene ontology (GO) molecular functions [35], as well as many PROSITE patterns and ELM motifs. Some other PROSITE patterns correspond to short motifs that are conserved in all members of certain protein families, which not necessarily are associated to known function-related residues. We chose to include this class of patterns in our annotation system, since they offer a quick way to verify the reliability of a match, and in many cases these motifs do contain functional residues. Hence, our annotations can be classified either as molecular functions or protein signatures. It is worth noticing that the annotation is extended to the whole patch but is also assigned to a subset of specific annotated functional residues.
In [32] the structural matches obtained from the comparison of he SURFACE library against the entire collection of surface clefts (both annotated and not annotated) were evaluated by means of the Z-score of each match length against the distribution of the match lengths for any given annotated patch. Here we perform an exhaustive analysis in order to find conditions for which a structural similarity also suggests a function-related similarity. First, only those matches which include annotated functional residues are considered, therefore each structural similarity match is likely to hold a functional meaning. This step is crucial since many matches may be obtained because of general fold similarity, without an underlying functional relationship. Finding a functional match induces an annotation of at least some of the residues, and suggests reasonable hypotheses as to function (we are currently investigating how to use our approach to find novel function-related structural motifs, i.e. recurrent structural matches between proteins that can not be explained only by fold similarity and that may imply a previously undetected functional similarity).
From the comparison of the SURFACE library against the entire collection of surface clefts, we collected a grand total of 65910 stringent matches among patch pairs, Description of the experimental procedure Figure 1 Description of the experimental procedure. Surface functional sites are automatically located and annotated as described in Methods. Surface clefts, identified by means of SURFNET, are filtered using a volume threshold, and annotated for the binding ability or for the presence of a functional motif from the PROSITE or ELM databases. This library (the SURFACE database) is used to scan a non-redundant collection of protein structures; a semi-automated procedure is used to define conditions for which the structural similarity implies also a functional relationship. Finally, the SURFACE database is used to analyze a list of proteins with unknown function from structural genomic projects, obtaining in several cases significant similarities that could have not been spotted through sequence or fold similarity. about 4.5% of which involve 6 or more residues and 4.5% involve 10 or more residues. A not negligible amount of these matches involve residue pairs whose relative distance is not conserved in the corresponding protein sequences. More interestingly, some of the matches involve residues whose sequence order and/or sequence spacing is different in the two proteins: some of these cases, that may be examples of convergent evolution, are currently under investigation. As an example, metals can interact with proteins by means of similar arrangements of residues that can be found across different folds [36][37][38]. Scanning our dataset with zinc-binding patches leads to the finding of significant matches to proteins belonging to 42 different folds and 6 different classes as defined by SCOP [39]. Different metal-binding patches lead to similar findings, even though less dramatic. Further analysis would suggest how many of these cases are associated with functional similarities as well.
The fraction of matches validated (as described in the Methods section) sensibly increases with the Z-score (Table 1). At lower Z-scores, the GO terms and SWISS-PROT keywords validation methods are more represented, while, for more significant matches, ability to bind the same ligands, fold similarity and co-presence of PROSITE motifs become more relevant.
The matches that cannot be structurally or functionally justified by these methods and that are characterized by a high Z-score are relatively few (see Table 1). 171 matches out of 2173 (7.9%) having a Z-score higher than 7 are not validated following the above mentioned criteria (Table  1). Of these 171 matches, 130 can be considered as true positive matches, confirmed by literature and information derived from different sources and databases. The remaining 41 matches (1.9%) are not confirmed and should be tested experimentally. About 2% of the highly significant matches can be considered as possible false positive hits or new annotations. Some of these cases are shown and discussed in Figure 2(a,b).
From this validation procedure the emerging result is that, using stringent parameters in the comparison step and using the Z-score as a threshold, our algorithm is reliable and able to spot local structural similarities related to functional relationships with only few non confirmed hits, which can be considered as false positives or as testable hypotheses.
An estimation of false negative matches (defining false negative match as the missing detection of structural similarity between two proteins sharing the same function) is not immediate, for the reason that the same or similar molecular function may be achieved in different ways using a different three-dimensional residue arrangement. We esti-mated the occurrence of false negatives for PROSITE annotated patches, using the list of known true positives (for which the function encoded by the pattern is experimentally verified) for each pattern that is provided by PROSITE. The procedure is done as follows: for all the patches annotated with a given PROSITE pattern, we collect all matches obtained scanning with these patches the entire patches dataset, selecting only those matches having Z-score higher than a fixed threshold. The fraction of known true positives that are not found using the patternannotated patches as queries (i.e. the false negatives), when retrieving only those matches having Z-score higher than 5, is 0.3 (meaning that we are able to correctly retrieve the 70% of the occurrences of PROSITE patterns in the dataset), and it raises to 0.35 setting the Z-score threshold to 7.

Benchmark cases
To further test the ability of the procedure in finding known cases of functional similarities among proteins for which sequence and/or structure similarity is not significant, a number of benchmark cases were investigated (Figure 3): i) The S. cerevisiae and the E. coli chorismate mutase (PDB codes: 1ecm and 4csm, respectively), despite the very low sequence identity, share a similar fold and a similar main functional site [18,21]. The 1ecm largest patch is annotated for the oxy-bridged prephenic acid binding ability. Using this patch as a query, the highest Z-score match is found with the 4csm largest patch (Figure 3a).
iii) Metal ions can be coordinated by histidine clusters. We identified a similarity between the human tumor necrosis factor-alpha-converting enzyme (PDB code: 1bkc) Zn binding site and the E. coli peptide deformylase (PDB code: 1icj) Ni binding site, despite their sequence and fold diversity ( Figure 3c). The zinc-binding patch of 1bkc shares eight residues in the same structural Significantly matching residues on proteins sharing no structural or sequence similarity Figure 2 Significantly matching residues on proteins sharing no structural or sequence similarity. Similarity detected comparing the SURFACE database of annotated functional sites against a list of annotated monomers (a,b) or proteins with unknown function from structural genomics projects (c,d,e,f); the annotated patch residues are colored in blue, the matching residues in red; whenever possible, the patch annotation (bound ligand or PROSITE pattern) is shown. (a) Similarity detected between the E. coli UDP-galactose 4-epimerase (PDB code 1nah) NADH-binding patch and the H. influenzae YecO methyltransferase (1im8); the NAD co-crystallized with 1nah is shown; the similarity involves 7 residues (with a Z-score 9.06). (b) Structural similarity between the HEXOKINASES PROSITE pattern-annotated patch of the human hexokinase type I (1qha) and the bacteriophage ms2 capsid protein; additional 1qha annotated residues are shown in yellow. (c) Structural similarity detected between the B. subtilis Yqvk protein, and the Wolinella succinogenes fumarate reductase cytochrome B subunit heme group binding patch. (d) Match between Hi1480 protein from Haemophilus influenzae and the bovine cytochrome Bc1 hemebinding patch. (e) Similarity between the B. subtilis protein Yqeu and the E. coli Grea transcript cleavage factor GREAB_1-annotated patch; additional pattern-annotated residues are shown in yellow. (e) Similarity between E. coli lysozyme inhibitor and two ATP-binding patches, the Rattus norvegicus 6-Phosphofructo-2-Kinase/ Fructose-2,6-Bisphosphatase major patch (red) and the mouse Aaa ATPase P97 (green).
conformation with the nickel-binding patch of 1icj, with a Z-score of 10.66. iv) Nucleotide binding abilities can be associated with several unrelated proteins; we detected a high-scoring match between the GTP-binding annotated patch of the human p21 ras protein (5p21) and the L. casei Hpr kinase (1jb1) that aligns eight residues with a Z-score of 9.01 ( Figure 3d). These two proteins do not share any significant sequence or fold similarities.
As a further test, we analyzed the flavin-adenine dinucleotide (FAD) binding pockets, known to share structural similarities with other adenine-containing nucleotide binding pockets, despite sequence and fold differences [41,42]. FAD consists of an adenosine monophosphate (AMP) linked to a flavin mononucleotide (FMN) through a pyrophosphate bond and is involved as a cofactor in many biological processes. Using the FAD-binding patch of the Zea mays polyamine oxidase (1b37) as a bait, we selected 9 prey patches with Z-score higher than 12: 8 preys are annotated as being able to bind a FAD molecule and belongs to the same SCOP fold (FAD/NAD(P)-binding domain). The remaining trapped patch is the biggest patch of the trimethylamine dehydrogenase from Methylophilus methylotrophus (1djn), an iron-sulfur flavoprotein, and it is annotated as ADP-binding. 1djn is co-crystallized also with a FMN, which is very similar to FAD, but this ligand is associated to the second largest patch of the 1djn structure. The residues, which were associated by the alignment program, are shown in Figure 3e. These proteins share a very low sequence similarity, which cannot be revealed using BLAST2 [43]. The ADP binding patch of the 1djn structure is nicely superposed to the other patches in the binding pocket (Figure 3e), but shares no evident fold similarity with the other ones, and belongs to a different SCOP fold (the nucleotide-binding domain). When the selected structures in Figure 3f are physically superposed (finding the least-square fitting of the matching residues), also the ligands bound to these structures turn out to be nicely superposed. The procedure could therefore highlight the ability to bind a subset of the FAD Table 1: Structural matches Z-score distribution and validation. This Table shows the number of structural matches (second column from the left) found as a function of the Z-score of the match. The third column from the left (labeled "validated") reports the number of matches for which at least one of the validation criteria holds. The following columns show a breakdown of the number of matches validated by each validation condition (from the fourth column on the left to the rightmost: same PROSITE pattern annotation; same binding ability; common GO term annotation; same SCOP fold; same Enzyme Classification number; sequence similarity at least 40%; common SwissProt keyword). Note that the sum of the matches validated by the different criteria for each row is higher than the total number of validated matches at that given Z-score, since some matches can satisfy more than one condition. At increasing Z-scores, the ratio of validation condition that we consider less reliable (SwissProt keywords, GO terms) decreases, while the ratio of more reliable annotations (i.e. same binding ability, same PROSITE pattern annotation) increases. Benchmark cases analysis Figure 3 Benchmark cases analysis. (a) Structural superposition of the S. cerevisiae (red) and the E. coli (blue) chorismate mutase (PDB code 4csm and 1ecm, respectively). These two patches align ten residues, with a resulting Z-score of 15.76. (b) Structural superposition of the 4-hydroxyphenylpyruvate dioxygenase (PDB code 1cjx, red), the 2,3-dihydroxybiphenyl 1,2-dioxygenase (1han, blue), catechol 2,3-dioxygenase (1mpy, green) and the methylmalonyl-Coa epimerase (1jc5, yellow). The 1han co-crystallized iron ion is shown. (c) Superposition of the tumor necrosis factor-alpha-converting enzyme (1bkc, red) and the peptide deformylase (1icj, blue). The 1icj co-crystallized nickel ion is shown. (d) Structural superposition of the human P21 ras protein (5p21, red) and HprK/P 1jb1 (blue). (e) Structural superposition of the 1b37 FAD-binding pocket (red) with the highest-score matches obtained in a database search (blue). The 1b37-bound FAD is shown. (f) Bound ligands superposition. Using the threedimensional transformation used to superpose the residues aligned in (e), also ligands that are bound to some of these proteins are consequently superposed. The ADP molecule bound to the 1djn patch nicely matches the ADP moiety in the similar FADbinding pockets.   molecule, namely an ADP molecule in the 1djn major patch, even with very low levels of sequence and structure similarity. Using each FAD binding patch to scan the dataset, we selected only proteins for which known functional properties are consistent with the FAD or nucleotide binding ability.

Structural genomics proteins analysis
With the stringent parameters described above, we were able to detect only matches linked to function-related similarities, even in cases of non-homologous proteins. For that reason, once proved to be reliable, the procedure can be applied as a predictive tool to obtain clues concerning the function(s) of uncharacterized proteins.
We selected 257 protein structures from the PDB, corresponding to 513 chains that are marked as being of unknown function, or for being a hypothetical protein or for having been solved within a structural genomics project. We analyzed these structures by looking for reliable similarities to our functional sites library and were able to suggest one or more molecular functions to 191 of these chains, for a total of 534 similarity matches. For each match, we checked if the previously described criteria hold (i.e. common GO term, SwissProt keyword, EC number or SURFACE annotation). If not, a literature search has been done to verify the functional relationship. By means of this analysis of the likelihood of each single match, we found that 322 (the 60.3%) of these hits are validated by experimental analysis that have already char-acterized many of these proteins, while only 29 matches (5.4%) are not found confirmed in previous findings; 107 (20%) hits involve proteins for which the functions are still unknown; 76 hits (14.2%) involve proteins for which a hypothetical function has been assigned by means of sequence or structure global similarity. In this latter case, the function-related annotation obtained from our method can be considered as a new functional annotation that corrects or improves the actual function assignment. Hence, we were able to propose a function by similarity using the annotated patch database 184 times, to 127 different chains (matches with Z-score at least 7 are shown in Table 2). 56% of these new functional annotations are about a PROSITE pattern, the remaining 44% about a ligand binding ability; this is somewhat surprising, since the majority of the patches annotations in the SURFACE library regards binding abilities. A selection of the proposed functional regions is shown in Figure 2(c,d,e,f), while the complete list can be found at http:// cbm.bio.uniroma2.it/surface/structuralGenomics.html. For each match we tested the BLAST2 pair-wise sequence similarity between the sequence of the protein to which the query patch belongs and the target protein sequence, the PsiBLAST sequence similarity matches obtained by running the target sequence versus the non-redundant SwissProt+TrEMBL sequence database, the global structural similarities of the target structure in the PDB using SSM, and the local similarity using PINTS [24]. The match with the highest Z-score (14.29

2: Non-validated functional annotations of non-annotated surface patches. Functional annotated sites have been compared to a collection of surface patches extracted from a non-redundant PDB subset. The reliability of each match was estimated via a series of criteria, as described in the text. The remaining similarities may be new functional annotations of uncharacterized functional sites, or false positive matches, and are shown in this table. Columns:(i) PDB code, chain name and patch number in the annotated query patch; (ii) Description of the protein to which the query patch belongs; (iii) Query patch functional annotation; (iv) Target patch; (v)
Description of the protein to which the target patch belongs; (vi) Z-score of the match; (vii) SSM Q score; (viii) SSM P score; (ix) SSM Z score. The SSM Q score takes into account the number of aligned residues, their r.m.s.d. and the size of the proteins; a high Q score means a good similarity. The SSM P score is the log of the pValue (the probability that the match occurred by chance); P scores higher than 3 are considered significant by the authors of the method. (Continued) genes fumarate reductase cytochrome B subunit major patch (1qlaC1), annotated with the heme group binding ability; the structural similarity involves 7 residues. The two proteins do not share any sequence or structural similarity, as checked using BLAST and the structural comparison algorithm SSM [44]. A PsiBLAST run of the Yqvk sequence against the non-redundant SwissProt+TrEMBL shows a significant similarity (E-value 4e-19) with the mouse cobalamin adenosyltransferase (SwissProt entry name MMAB_MOUSE), while the SSM comparison against the whole PDB leads to only one significant similarity, with another uncharacterized protein, the conserved protein 0546 From Thermoplasma acidophilum (1nog). A PINTS comparison [24]  In some cases we found a structural similarity between a protein with unknown function and two patches annotated with the same function, giving strength to the hypothesis of function-related similarity. The conserved hypothetical protein (Tm0667) from Thermotoga maritima (PDB code 1j6o) shows a structural similarity with surface patches of E. coli nucleotidyltransferase (1gupA2) and Desulfovibrio gigas rubredoxin:oxygen oxidoreductase (1e5dA4), both annotated with the iron binding ability. The E. coli lysozyme inhibitor (1gpq), whose function is still uncharacterized, may bind ATP given the similarity to the Rattus norvegicus 6-Phosphofructo-2-Kinase/ Fructose-2,6-Bisphosphatase major patch (1bif_1) and the mouse Aaa ATPase P97 (second patch (1e32A2)).
For each described match we propose that the detected structural similarity reveals a function-related similarity.
For each match we checked whether the similarity could have been detected by means of sequence similarity, as checked using BLAST and PsiBLAST, or structural comparison, as checked by means of SSM and PINTS. Our approach, that is based on comparison of local functional surface residues, independently on their sequence order, may overcome the limitations of current methods possibly due to our incomplete knowledge of the sequence/ structure/function relationship or to convergent evolution. Even using PINTS, which is a tool similar in philosophy to our approach, the findings are different, suggesting that different tools may be complementary in the difficult task of protein functional annotation; on the other hand, this may also highlight the difficulty in evaluating the significance of local similarities that in many cases are restricted to a very small number of residues.

Conclusion
The expected burst in the number of protein structures that are not associated to a biological function, stimulated by the structure genomics programs, has emphasized the need for tools to reveal structural regularities even in proteins that do not share sequence or fold similarity [1,45]. Protein structures selected in structural genomics projects usually share very little sequence similarity with the dataset of already characterized proteins [46]. Sequence analysis tools are therefore unsuitable for inferring their functions. Moreover, cases are known where active site residues are not conserved in proteins sharing a common structural fold; therefore, "traditional" structure comparison tools are also not always able to help in functionrelated annotation.
Using a fully automated procedure, we obtained a reliable library of protein annotated functional sites. A fast structural comparison algorithm allows the rapid scanning of one or more protein structures with the library looking for local structural similarities. This method is designed to help in functional annotation in difficult cases. Our annotated surface patches determination and comparison method offers a new and powerful resource for detecting related function among unrelated proteins, for proteins solved in structural genomics projects or for identifying new function-related sites on the surface of already characterized proteins. We have been able to provide one or more functional clues to a large set of novel proteins, and, where functional evidences are already known, our findings confirm them. Moreover, just as proteins with different sequence and fold can share a similar functional site, proteins with similar sequence and/or fold can have small local differences leading to a completely different function [1,21]. Our method, which is focused on a detailed analysis of functional sites, is able to successfully predict protein functions in these difficult cases. Therefore, it can be used in analyzing the complex evolutionary relationships among protein sequence, structure and function [47][48][49]. The complete list of the functional predictions that we obtained is accessible at URL http:// cbm.bio.uniroma2.it/surface/structuralGenomics.html; the structurally similar residues are shown for each match, and the structural superposition can be viewed through the browser plug-in Chime or RasMol. A novel publicly available web server, PdbFun [50], has been developed to allow the on-line structural comparison of user-defined subsets of residues of protein chains, and pre-defined subsets, like the SURFACE library of annotated functional sites, will be provided.

Functional site library extraction and annotation
The SURFACE database [32] stores a library of 1521 annotated function-related surface regions obtained using the following procedure (described in Figure 1): first, the SURFNET algorithm [51] is applied to a non-redundant, representative list of around 2000 protein chains from the PDB database [52] (downloadable at http:// www.ncbi.nlm.nih.gov/Structure/VAST/nrpdb.html) in order to find all the surface clefts with a volume higher than an arbitrary threshold (200 Å 3 ); then for each cleft, a surface patch is identified as a collection of solventexposed residues using the MASK algorithm (that is part of the SURFNET package); finally, we infer the function of such surface patches using two kinds of annotations: ability to bind (associated to surface patch residues that are contacting a bound ligand), and match with PROSITE or ELM [33,34] functional motifs. The ability to bind annotation is carried out selecting those residues within 3.5 Å distance from any of the atoms of a ligand found in the crystal structure. Whenever a single patch contains more than 75% of the ligand-contacting residues (62% of the cases), we assign the ligand binding ability to this surface cleft. Considering only large organic molecules and metal ions, the ratio of the ligands that can be unequivocally associated to a single patch raises to 78%. PROSITE annotations are achieved scanning the sequences of monomers in our dataset using the ScanProsite algorithm [53], finding 928 matches. 12 matches were found with the ELM [34] experimentally verified instances. We did not consider those patterns marked by PROSITE as unspecific. Moreover, we annotated only those residues that correspond to non-X positions in the regular expression and that are exposed to the solvent according to the NACCESS procedure [54,55]. Once the dataset chains have been annotated, we map the annotated residues on the structure and in the surface patches. Whenever a single patch contains more than 75% of the pattern exposed residues, we assign the function encoded by this pattern to the patch (43% of the cases).

Structural comparison
A sequence/fold-independent algorithm was used for local surface comparison [32]. The algorithm starts from a seed match (a pair of residues in the query that can be found in the target, at the same distance and with similar physical and chemical characteristics). The structural superposition, obtained by the quaternions method [56] and assessed at each step by residue similarity and root mean square deviation (r.m.s.d.) of the matching residues, is extended adding neighboring residues to the seed match until r.m.s.d and residue similarity are under userdefined thresholds (we used a similarity at least equal to 0.3 for each added pair of residues, and an average similarity at least equal to 1.2, using the Dayhoff substitution matrix [57] and 0.8Å as maximum r.m.s.d.). We consider only structural matches that include at least a fixed fraction (50%) of functional annotated residues, to increase the likelihood that the structural match is a functionrelated match as well. The algorithm is very fast and explores all the combinations of similar/identical residues in a sequence-independent way. The score of the match is the number of residues that can be superposed within the defined similarity thresholds. The significance of the score is evaluated by calculating the Z-score over the score distribution of the query patch comparison with the whole dataset: for each match, the Z-score is computed as the difference between the score of the match and the average score of all the matches for the query patch, divided by the standard deviation.
In order to obtain an estimate of the number of true positive matches, defining a true positive match as a structural similarity that implies also a functional similarity, we checked if the two matching proteins share also: (i) a common Gene Ontology (GO) term; (ii) a common SwissProt keyword; (iii) the same Enzyme Classification (EC) number; (iv) the same functional annotation (i.e. the binding of the same ligand or a match with the same PROSITE or ELM pattern). Gene Ontology terms search is limited to molecular function or biological process annotations linked to PDB structures from the GOA project [35]. SwissProt [58] keywords were extracted from the SwissProt entries corresponding to the DBREF field in the PDB [52] files header. If this was not available, we extracted the sequence from the order of residues in the structure, then we looked for a close homolog (sequence similarity higher than 95% using BLAST) in the SwissProt