Open Access

Functional annotation by identification of local surface similarities: a novel tool for structural genomics

  • Fabrizio Ferrè1, 2Email author,
  • Gabriele Ausiello2,
  • Andreas Zanzoni2 and
  • Manuela Helmer-Citterich2
BMC Bioinformatics20056:194

DOI: 10.1186/1471-2105-6-194

Received: 26 January 2005

Accepted: 02 August 2005

Published: 02 August 2005

Abstract

Background

Protein function is often dependent on subsets of solvent-exposed residues that may exist in a similar three-dimensional configuration in non homologous proteins thus having different order and/or spacing in the sequence. Hence, functional annotation by means of sequence or fold similarity is not adequate for such cases.

Results

We describe a method for the function-related annotation of protein structures by means of the detection of local structural similarity with a library of annotated functional sites. An automatic procedure was used to annotate the function of local surface regions. Next, we employed a sequence-independent algorithm to compare exhaustively these functional patches with a larger collection of protein surface cavities. After tuning and validating the algorithm on a dataset of well annotated structures, we applied it to a list of protein structures that are classified as being of unknown function in the Protein Data Bank. By this strategy, we were able to provide functional clues to proteins that do not show any significant sequence or global structural similarity with proteins in the current databases.

Conclusion

This method is able to spot structural similarities associated to function-related similarities, independently on sequence or fold resemblance, therefore is a valuable tool for the functional analysis of uncharacterized proteins. Results are available at http://cbm.bio.uniroma2.it/surface/structuralGenomics.html

Background

Detection of sequence or fold similarity is often used to infer the function of uncharacterized proteins. By this approach one can tentatively assign a function to approximately 45–80% of the proteins identified by the genomic projects [1, 2]. However, function is mostly determined by the physical, chemical and geometric properties of the protein surfaces [3, 4], and cases have been described where the same local spatial distribution of residues important for function is achieved with apparently unrelated structures and/or sequences [5]. One of the best known examples is represented by the SHD catalytic triad of serine proteinases [68]. Furthermore, surface similarities have been detected in unrelated ATP/GTP binding proteins [9, 10] and in the guanine binding sites of p21Ras family GTPases or in the RNA binding site of bacterial ribonucleases [10]. By local structural comparison Hwang et al. [11] were able to infer correctly the nucleotide binding ability of an uncharacterized Methanococcus jannaschii protein.

On the other hand, similar folds can have different functions if their active sites have diverged [1215]. As a consequence, methods purely relying on sequence and global structure comparison may lead to inaccurate function-related annotations in cases in which few residues are responsible for the specificity of substrate interaction.

The vast majority of well-studied functions (enzymatic activities, binding abilities etc.) are encoded by a relatively small set of residues, often not contiguous in the protein sequence but organized in a conserved geometry on the protein surface that may be used as a marker for reliable functional annotation. Although exposed to the solvent, these function-related residues are often located in surface clefts or cavities [16]. Such residues define functional modules conserved in some proteins sharing a molecular function even if differing in sequence and structure. Several tools for discovering conserved three-dimensional patterns in protein structures have already been proposed [1720]. Schmitt et al. [21] developed a clique-based method to detect functional relationships among proteins. This approach does not rely on detection of sequence or fold homology and highlights a number of non-obvious similarities among protein cavities. The algorithm, however, is computationally intensive and cannot be applied to an all-against-all analysis of protein surface regions. Binkowski and co-workers [22] recently described an approach for detecting sequence and spatial patterns of protein surfaces: the underlying algorithm is fast, but cannot identify similarities that are independent of the residue order in the compared proteins. Two related papers [23, 24] describe a method for local structural similarity detection, which is of great relevance since it is able to evaluate the statistical significance of each match. This method (PINTS) has been then used to analyze protein structures from structural genomics projects [25]. Other recent papers present algorithms able to find structural motifs possibly related to a function and to use them to scan protein structure libraries [2631].

In a previous work [32] we described the construction of a non redundant library of surface annotated functional sites and a fast comparison algorithm able to find structural similarities independently on the residue sequence order. We report here the analysis of the results of the first all-versus-all comparison of the protein functional sites, the validation of the comparison procedure in a test dataset and its application for annotating a dataset composed of proteins solved in structural genomics projects. The results are available for experimental test at the address http://cbm.bio.uniroma2.it/surface/structuralGenomics.html.

Results and discussion

Functional sites comparison

We used the compendium of protein surface regions associated to molecular functional sites stored in the SURFACE database [32]. This is a collection of 1521 annotated functional regions obtained following the procedure described in Figure 1 and in the Methods section. Each patch has at least a function-related annotation, that may be the ability to bind a certain ligand, or a match with a PROSITE or ELM pattern [33, 34]. Ligand-binding abilities are included among gene ontology (GO) molecular functions [35], as well as many PROSITE patterns and ELM motifs. Some other PROSITE patterns correspond to short motifs that are conserved in all members of certain protein families, which not necessarily are associated to known function-related residues. We chose to include this class of patterns in our annotation system, since they offer a quick way to verify the reliability of a match, and in many cases these motifs do contain functional residues. Hence, our annotations can be classified either as molecular functions or protein signatures. It is worth noticing that the annotation is extended to the whole patch but is also assigned to a subset of specific annotated functional residues.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-6-194/MediaObjects/12859_2005_Article_519_Fig1_HTML.jpg
Figure 1

Description of the experimental procedure. Surface functional sites are automatically located and annotated as described in Methods. Surface clefts, identified by means of SURFNET, are filtered using a volume threshold, and annotated for the binding ability or for the presence of a functional motif from the PROSITE or ELM databases. This library (the SURFACE database) is used to scan a non-redundant collection of protein structures; a semi-automated procedure is used to define conditions for which the structural similarity implies also a functional relationship. Finally, the SURFACE database is used to analyze a list of proteins with unknown function from structural genomic projects, obtaining in several cases significant similarities that could have not been spotted through sequence or fold similarity.

In [32] the structural matches obtained from the comparison of he SURFACE library against the entire collection of surface clefts (both annotated and not annotated) were evaluated by means of the Z-score of each match length against the distribution of the match lengths for any given annotated patch. Here we perform an exhaustive analysis in order to find conditions for which a structural similarity also suggests a function-related similarity. First, only those matches which include annotated functional residues are considered, therefore each structural similarity match is likely to hold a functional meaning. This step is crucial since many matches may be obtained because of general fold similarity, without an underlying functional relationship. Finding a functional match induces an annotation of at least some of the residues, and suggests reasonable hypotheses as to function (we are currently investigating how to use our approach to find novel function-related structural motifs, i.e. recurrent structural matches between proteins that can not be explained only by fold similarity and that may imply a previously undetected functional similarity).

From the comparison of the SURFACE library against the entire collection of surface clefts, we collected a grand total of 65910 stringent matches among patch pairs, about 4.5% of which involve 6 or more residues and 4.5% involve 10 or more residues. A not negligible amount of these matches involve residue pairs whose relative distance is not conserved in the corresponding protein sequences. More interestingly, some of the matches involve residues whose sequence order and/or sequence spacing is different in the two proteins: some of these cases, that may be examples of convergent evolution, are currently under investigation. As an example, metals can interact with proteins by means of similar arrangements of residues that can be found across different folds [3638]. Scanning our dataset with zinc-binding patches leads to the finding of significant matches to proteins belonging to 42 different folds and 6 different classes as defined by SCOP [39]. Different metal-binding patches lead to similar findings, even though less dramatic. Further analysis would suggest how many of these cases are associated with functional similarities as well.

The fraction of matches validated (as described in the Methods section) sensibly increases with the Z-score (Table 1). At lower Z-scores, the GO terms and SWISS-PROT keywords validation methods are more represented, while, for more significant matches, ability to bind the same ligands, fold similarity and co-presence of PROSITE motifs become more relevant.

The matches that cannot be structurally or functionally justified by these methods and that are characterized by a high Z-score are relatively few (see Table 1). 171 matches out of 2173 (7.9%) having a Z-score higher than 7 are not validated following the above mentioned criteria (Table 1). Of these 171 matches, 130 can be considered as true positive matches, confirmed by literature and information derived from different sources and databases. The remaining 41 matches (1.9%) are not confirmed and should be tested experimentally. About 2% of the highly significant matches can be considered as possible false positive hits or new annotations. Some of these cases are shown and discussed in Figure 2(a,b).
Table 1

Structural matches Z-score distribution and validation. This Table shows the number of structural matches (second column from the left) found as a function of the Z-score of the match. The third column from the left (labeled "validated") reports the number of matches for which at least one of the validation criteria holds. The following columns show a breakdown of the number of matches validated by each validation condition (from the fourth column on the left to the rightmost: same PROSITE pattern annotation; same binding ability; common GO term annotation; same SCOP fold; same Enzyme Classification number; sequence similarity at least 40%; common SwissProt keyword). Note that the sum of the matches validated by the different criteria for each row is higher than the total number of validated matches at that given Z-score, since some matches can satisfy more than one condition. At increasing Z-scores, the ratio of validation condition that we consider less reliable (SwissProt keywords, GO terms) decreases, while the ratio of more reliable annotations (i.e. same binding ability, same PROSITE pattern annotation) increases.

Z-score

Total

Validated

PROSITE

Ligand

GO

Scop

E.C.

Seq. sim.

SwissProt kw

3.0

31341

7066

366

951

3565

765

99

2

5655

3.5

14948

4002

747

830

2222

889

48

3

2944

4.0

9721

2814

557

613

1680

788

44

1

2043

4.5

3942

1346

440

467

841

390

32

1

989

5.0

1549

764

281

234

436

411

5

1

514

5.5

976

612

287

181

320

399

7

0

342

6.0

639

457

177

209

267

271

3

0

323

6.5

621

548

279

258

298

447

4

0

383

7.0

365

328

157

115

180

246

2

0

200

7.5

260

219

105

68

109

176

6

1

152

8.0

270

238

104

87

149

191

0

1

169

8.5

209

195

80

57

129

153

8

1

131

9.0

122

107

54

54

70

87

0

0

63

9.5

137

129

60

48

74

119

0

0

80

10.0

124

113

53

61

75

104

0

1

86

10.5

55

51

17

22

29

43

2

0

36

11.0

106

103

46

40

65

91

4

0

66

11.5

88

88

42

43

65

80

5

0

55

12.0

78

77

33

34

51

75

5

0

52

12.5

71

69

26

32

38

64

5

1

54

13.0

49

47

30

21

24

45

0

0

30

13.5

39

39

9

19

17

39

1

0

24

14.0

29

29

14

16

18

29

3

0

25

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-6-194/MediaObjects/12859_2005_Article_519_Fig2_HTML.jpg
Figure 2

Significantly matching residues on proteins sharing no structural or sequence similarity. Similarity detected comparing the SURFACE database of annotated functional sites against a list of annotated monomers (a,b) or proteins with unknown function from structural genomics projects (c,d,e,f); the annotated patch residues are colored in blue, the matching residues in red; whenever possible, the patch annotation (bound ligand or PROSITE pattern) is shown. (a) Similarity detected between the E. coli UDP-galactose 4-epimerase (PDB code 1nah) NADH-binding patch and the H. influenzae YecO methyltransferase (1im8); the NAD co-crystallized with 1nah is shown; the similarity involves 7 residues (with a Z-score 9.06). (b) Structural similarity between the HEXOKINASES PROSITE pattern-annotated patch of the human hexokinase type I (1qha) and the bacteriophage ms2 capsid protein; additional 1qha annotated residues are shown in yellow. (c) Structural similarity detected between the B. subtilis Yqvk protein, and the Wolinella succinogenes fumarate reductase cytochrome B subunit heme group binding patch. (d) Match between Hi1480 protein from Haemophilus influenzae and the bovine cytochrome Bc1 heme-binding patch. (e) Similarity between the B. subtilis protein Yqeu and the E. coli Grea transcript cleavage factor GREAB_1-annotated patch; additional pattern-annotated residues are shown in yellow. (e) Similarity between E. coli lysozyme inhibitor and two ATP-binding patches, the Rattus norvegicus 6-Phosphofructo-2-Kinase/ Fructose-2,6-Bisphosphatase major patch (red) and the mouse Aaa ATPase P97 (green).

From this validation procedure the emerging result is that, using stringent parameters in the comparison step and using the Z-score as a threshold, our algorithm is reliable and able to spot local structural similarities related to functional relationships with only few non confirmed hits, which can be considered as false positives or as testable hypotheses.

An estimation of false negative matches (defining false negative match as the missing detection of structural similarity between two proteins sharing the same function) is not immediate, for the reason that the same or similar molecular function may be achieved in different ways using a different three-dimensional residue arrangement. We estimated the occurrence of false negatives for PROSITE annotated patches, using the list of known true positives (for which the function encoded by the pattern is experimentally verified) for each pattern that is provided by PROSITE. The procedure is done as follows: for all the patches annotated with a given PROSITE pattern, we collect all matches obtained scanning with these patches the entire patches dataset, selecting only those matches having Z-score higher than a fixed threshold. The fraction of known true positives that are not found using the pattern-annotated patches as queries (i.e. the false negatives), when retrieving only those matches having Z-score higher than 5, is 0.3 (meaning that we are able to correctly retrieve the 70% of the occurrences of PROSITE patterns in the dataset), and it raises to 0.35 setting the Z-score threshold to 7.

Benchmark cases

To further test the ability of the procedure in finding known cases of functional similarities among proteins for which sequence and/or structure similarity is not significant, a number of benchmark cases were investigated (Figure 3):
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-6-194/MediaObjects/12859_2005_Article_519_Fig3_HTML.jpg
Figure 3

Benchmark cases analysis. (a) Structural superposition of the S. cerevisiae (red) and the E. coli (blue) chorismate mutase (PDB code 4csm and 1ecm, respectively). These two patches align ten residues, with a resulting Z-score of 15.76. (b) Structural superposition of the 4-hydroxyphenylpyruvate dioxygenase (PDB code 1cjx, red), the 2,3-dihydroxybiphenyl 1,2-dioxygenase (1han, blue), catechol 2,3-dioxygenase (1mpy, green) and the methylmalonyl-Coa epimerase (1jc5, yellow). The 1han co-crystallized iron ion is shown. (c) Superposition of the tumor necrosis factor-alpha-converting enzyme (1bkc, red) and the peptide deformylase (1icj, blue). The 1icj co-crystallized nickel ion is shown. (d) Structural superposition of the human P21 ras protein (5p21, red) and HprK/P 1jb1 (blue). (e) Structural superposition of the 1b37 FAD-binding pocket (red) with the highest-score matches obtained in a database search (blue). The 1b37-bound FAD is shown. (f) Bound ligands superposition. Using the three-dimensional transformation used to superpose the residues aligned in (e), also ligands that are bound to some of these proteins are consequently superposed. The ADP molecule bound to the 1djn patch nicely matches the ADP moiety in the similar FAD-binding pockets.

  1. i)

    The S. cerevisiae and the E. coli chorismate mutase (PDB codes: 1ecm and 4csm, respectively), despite the very low sequence identity, share a similar fold and a similar main functional site [18, 21]. The 1ecm largest patch is annotated for the oxy-bridged prephenic acid binding ability. Using this patch as a query, the highest Z-score match is found with the 4csm largest patch (Figure 3a).

     
  2. ii)

    The Glyoxalase/Bleomycin resistance protein/Dihydroxybiphenyl dioxygenase fold is common to several unrelated metal ion binding proteins sharing similar catalytic mechanisms, including the bleomycin resistance protein, glyoxalase I, and a family of extradiol dioxygenases [40]. We detected a significant similarity among P. fluorescens 4-hydroxyphenylpyruvate dioxygenase (PDB code 1cjx), B. cepacia 2,3-dihydroxybiphenyl 1,2-dioxygenase (1han), P. putida catechol 2,3-dioxygenase (1mpy) and P. shermanii methylmalonyl-Coa epimerase (1jc5). The comparison algorithm correctly identifies the residues involved in Fe binding (Figure 3b). 1han second largest patch is annotated for the iron binding ability. Structural matches with 1mpy, 1cjx and 1jc5 functional sites are found at high Z-score (7.19).

     
  3. iii)

    Metal ions can be coordinated by histidine clusters. We identified a similarity between the human tumor necrosis factor-alpha-converting enzyme (PDB code: 1bkc) Zn binding site and the E. coli peptide deformylase (PDB code: 1icj) Ni binding site, despite their sequence and fold diversity (Figure 3c). The zinc-binding patch of 1bkc shares eight residues in the same structural conformation with the nickel-binding patch of 1icj, with a Z-score of 10.66.

     
  4. iv)

    Nucleotide binding abilities can be associated with several unrelated proteins; we detected a high-scoring match between the GTP-binding annotated patch of the human p21 ras protein (5p21) and the L. casei Hpr kinase (1jb1) that aligns eight residues with a Z-score of 9.01 (Figure 3d). These two proteins do not share any significant sequence or fold similarities.

     

As a further test, we analyzed the flavin-adenine dinucleotide (FAD) binding pockets, known to share structural similarities with other adenine-containing nucleotide binding pockets, despite sequence and fold differences [41, 42]. FAD consists of an adenosine monophosphate (AMP) linked to a flavin mononucleotide (FMN) through a pyrophosphate bond and is involved as a cofactor in many biological processes. Using the FAD-binding patch of the Zea mays polyamine oxidase (1b37) as a bait, we selected 9 prey patches with Z-score higher than 12: 8 preys are annotated as being able to bind a FAD molecule and belongs to the same SCOP fold (FAD/NAD(P)-binding domain). The remaining trapped patch is the biggest patch of the trimethylamine dehydrogenase from Methylophilus methylotrophus (1djn), an iron-sulfur flavoprotein, and it is annotated as ADP-binding. 1djn is co-crystallized also with a FMN, which is very similar to FAD, but this ligand is associated to the second largest patch of the 1djn structure. The residues, which were associated by the alignment program, are shown in Figure 3e. These proteins share a very low sequence similarity, which cannot be revealed using BLAST2 [43]. The ADP binding patch of the 1djn structure is nicely superposed to the other patches in the binding pocket (Figure 3e), but shares no evident fold similarity with the other ones, and belongs to a different SCOP fold (the nucleotide-binding domain). When the selected structures in Figure 3f are physically superposed (finding the least-square fitting of the matching residues), also the ligands bound to these structures turn out to be nicely superposed. The procedure could therefore highlight the ability to bind a subset of the FAD molecule, namely an ADP molecule in the 1djn major patch, even with very low levels of sequence and structure similarity. Using each FAD binding patch to scan the dataset, we selected only proteins for which known functional properties are consistent with the FAD or nucleotide binding ability.

Structural genomics proteins analysis

With the stringent parameters described above, we were able to detect only matches linked to function-related similarities, even in cases of non-homologous proteins. For that reason, once proved to be reliable, the procedure can be applied as a predictive tool to obtain clues concerning the function(s) of uncharacterized proteins.

We selected 257 protein structures from the PDB, corresponding to 513 chains that are marked as being of unknown function, or for being a hypothetical protein or for having been solved within a structural genomics project. We analyzed these structures by looking for reliable similarities to our functional sites library and were able to suggest one or more molecular functions to 191 of these chains, for a total of 534 similarity matches. For each match, we checked if the previously described criteria hold (i.e. common GO term, SwissProt keyword, EC number or SURFACE annotation). If not, a literature search has been done to verify the functional relationship. By means of this analysis of the likelihood of each single match, we found that 322 (the 60.3%) of these hits are validated by experimental analysis that have already characterized many of these proteins, while only 29 matches (5.4%) are not found confirmed in previous findings; 107 (20%) hits involve proteins for which the functions are still unknown; 76 hits (14.2%) involve proteins for which a hypothetical function has been assigned by means of sequence or structure global similarity. In this latter case, the function-related annotation obtained from our method can be considered as a new functional annotation that corrects or improves the actual function assignment. Hence, we were able to propose a function by similarity using the annotated patch database 184 times, to 127 different chains (matches with Z-score at least 7 are shown in Table 2). 56% of these new functional annotations are about a PROSITE pattern, the remaining 44% about a ligand binding ability; this is somewhat surprising, since the majority of the patches annotations in the SURFACE library regards binding abilities. A selection of the proposed functional regions is shown in Figure 2(c,d,e,f), while the complete list can be found at http://cbm.bio.uniroma2.it/surface/structuralGenomics.html. For each match we tested the BLAST2 pair-wise sequence similarity between the sequence of the protein to which the query patch belongs and the target protein sequence, the PsiBLAST sequence similarity matches obtained by running the target sequence versus the non-redundant SwissProt+TrEMBL sequence database, the global structural similarities of the target structure in the PDB using SSM, and the local similarity using PINTS [24]. The match with the highest Z-score (14.29) is between the B. subtilis Yqvk protein (PDB code 1rty), and the Wolinella succinogenes fumarate reductase cytochrome B subunit major patch (1qlaC1), annotated with the heme group binding ability; the structural similarity involves 7 residues. The two proteins do not share any sequence or structural similarity, as checked using BLAST and the structural comparison algorithm SSM [44]. A PsiBLAST run of the Yqvk sequence against the non-redundant SwissProt+TrEMBL shows a significant similarity (E-value 4e-19) with the mouse cobalamin adenosyltransferase (SwissProt entry name MMAB_MOUSE), while the SSM comparison against the whole PDB leads to only one significant similarity, with another uncharacterized protein, the conserved protein 0546 From Thermoplasma acidophilum (1nog). A PINTS comparison [24] of Yqvk, against pre-compiled libraries of structural patterns, retrieves as most significant matches one with the human Small Nuclear Ribonucleoprotein Sm D3 (PDB code 1d3b), aligning 3 pairs of residues with r.m.s.d 0.32 and E-value 0.00481, and another with the pig Dihydropyrimidine Dehydrogenase 1htx (3 pairs aligned with r.m.s.d. 0.337 and E-value 0.00839). The heme binding ability thus may be a new functional annotation of this poorly known protein. The second highest Z-score match (13.32, 9 residues structurally aligned) occurs between Hi1480 protein from Haemophilus influenzae (1mw5) and the bovine cytochrome Bc1 heme-binding patch (1bgyC2). No significant sequence similarity is found in the SwissProt+TrEMBL (the highest match, whose E-value is 2.1, involves the putative E. coli RNA helicase, SwissProt entry name RHLE_ECOLI), as well as no significant matches are found using SSM. PINTS matches involving three residues are found with the virus influenzae Bha/Lsta protein (1mqm) and the Candida tropicalis Enoyl Thioester Reductase 2 (1h0k), whose E-values are 0.401 and 0.451, respectively. Another high-score match (Z-score 10.05, length 7 residues) is found between the B. subtilis protein Yqeu (1vhk) and the E. coli Grea transcript cleavage factor major patch (1grj_1), which is annotated with the GREAB_1 PROSITE pattern, a signature of this class of cleavage factors. Yqeu share SSM-detected structural similarities with another unknown-function protein (namely H. influenzae 1nxz) and significant sequence similarity with a list of hypothetical and uncharacterized bacterial proteins. PINTS reports a local structural similarity with the zinc-binding site of the E. coli CTP-ligated T state aspartate transcarbamoylase (E-value 0.00894, r.m.s.d 0.544 over three pairs of residues).
Table 2

Non-validated functional annotations of non-annotated surface patches. Functional annotated sites have been compared to a collection of surface patches extracted from a non-redundant PDB subset. The reliability of each match was estimated via a series of criteria, as described in the text. The remaining similarities may be new functional annotations of uncharacterized functional sites, or false positive matches, and are shown in this table. Columns:(i) PDB code, chain name and patch number in the annotated query patch; (ii) Description of the protein to which the query patch belongs; (iii) Query patch functional annotation; (iv) Target patch; (v) Description of the protein to which the target patch belongs; (vi) Z-score of the match; (vii) SSM Q score; (viii) SSM P score; (ix) SSM Z score. The SSM Q score takes into account the number of aligned residues, their r.m.s.d. and the size of the proteins; a high Q score means a good similarity. The SSM P score is the log of the pValue (the probability that the match occurred by chance); P scores higher than 3 are considered significant by the authors of the method.

Patch 1

Protein

Patch 1 Annotation

Patch 2

Protein

Z-score

SSM Qscore

SSM P-value

SSM Z-score

3mdeA1

Acyl-CoA dehydrogenase

LIG_CO8

1g5bB6

Bacteriophage lambda S/T Protein Phosphatase

9.59

0.01

0

0.5

1qhaA2

Hexokinase I

HEXOKINASES

1i78A5

Outer Membrane Protease Ompt

9.44

0.01

0

0.5

1qhaA2

Hexokinase I

HEXOKINASES

1zdhA2

Bacteriophage Ms2 Protein Capsid

9.44

0.01

0

0.1

1bp1_1

Bactericidal permeability-increasing protein

LIG__PC

1qlwA2

Bacterial esterase 713

9.07

0.01

0

1.5

1nah_1

UDP-galactose 4-epimerase

LIG_NAD

1im8A1

YecO methyltransferase

9.06

0.1

0

4

4blcA1

Beef liver catalase

LIG_NDP

1io1A5

Phase 1 Flagellin

8.86

0.01

0

1.4

1dbtA1

Orotidine 5'-Monophosphate Decarboxylase

OMPDECASE

1dj8A1

E. Coli Periplasmic Protein Hdea

8.76

0.03

0

1.9

1fp2A1

Isoflavone O-Methyltransferase

LIG_SAH

1nah_1

UDP-galactose 4-epimerase

8.6

0.05

0

5.5

1fps_1

Prenyltransferase Trimethylamine

POLYPRENYL_SY NTHET_1

1h6gA2

Alpha-catenin Molybdopterin Biosynthesis Moeb

8.54

0.04

0

0.3

1djnA1

dehydrogenase

LIG_ADP

1jwbB1

Protein

8.51

0.05

0

5.3

19hcA1

Cytochrome C

LIG_HEM

1umuB1

UmuD' protein

8.44

0.03

0

4.2

1qhaA1

Type I Hexokinase

HEXOKINASES

1e2uA1

Hybrid Cluster Protein

8.34

0.01

0

0.1

256bA1

Cytochrome B562

LIG_HEM

1gpjA1

Glutamyl-tRNA reductase

8.25

0.05

0

0.4

1ep1B1

Dihydroorotate Dehydrogenase B

LIG_FAD

1pmi_8

Phosphomannose Isomerase

8.18

0.02

0

0.3

1tsdA1

Thymidylate synthase

LIG_U18

1prhA1

Prostaglandin H2 Synthase-1 Formylmethanofuran: Tetrahydromethanopterin

8.16

0.01

0

0.1

2nlrA1

Endoglucanase

LIG_G2F

1ftrA1

Formyltransferase

8.05

0.02

0

0.5

1ej0A1

RNA Methyltransferase

LIG_SAM

2cmd_1

Malate Dehydrogenase

8.01

0.12

0

3.9

1ecmB1

Chorismate mutase

LIG_TSA

1b3qB1

Histidine Kinase Chea

7.96

0.02

0

2.8

1av6A3

Vaccinia Methyltransferase Vp39

LIG_SAH

1b3mA1

Sarcosine oxidase

7.95

0.02

0

2.8

1av6A3

Vaccinia Methyltransferase Vp39

LIG_SAH

1b4vA1

Cholesterole oxidase

7.95

0.02

0

0.9

1qrrA1

Sulfolipid Biosynthesis (Sqd1) Protein

LIG_NAD

1g6q12

Arginine methyltransferase HMT1

7.85

0.04

0

1.9

1qrrA1

Sulfolipid Biosynthesis (Sqd1) Protein

LIG_NAD

1im8A1

YecO methyltransferase

7.85

0.09

0

2.4

1qrrA1

Sulfolipid Biosynthesis (Sqd1) Protein

LIG_NAD

1khhA1

Guanidinoacetate methyltransferase

7.85

0.1

0

2.9

6reqA1

Methylmalonyl-Coa Mutase

LIG_3CP

1fepA2

Ferric Enterobactin Receptor

7.79

0.01

0

0

6reqA1

Methylmalonyl-Coa Mutase

LIG_3CP

1jihB10

Yeast DNA Polymerase Eta

7.79

0.01

0

1

1bgyC1

Cytochrome BC1

LIG_HEM

1dc1B2

Bsobi Restriction Endonuclease

7.62

0.01

0

0.4

1bgyC2

Cytochrome BC1

LIG_HEM

1k92A4

Argininosuccinate Synthetase

7.62

0.01

0

0.2

1bgyC2

Cytochrome BC1

LIG_HEM

5r1rA2

Ribonucleotide Reductase R1

7.62

0.01

0

0.9

1qanA1

Rrna Methyltransferase Ermc'

RRNA_A_DIMETH

1b37B1

Flavin-dependent polyamine oxidase

7.54

0.04

0

5.3

1qanA1

Rrna Methyltransferase Ermc'

RRNA_A_DIMETH

1b3mA1

Sarcosine oxidase

7.54

0.04

0

4.3

1qanA1

Rrna Methyltransferase Ermc'

RRNA_A_DIMETH

1gpeA1

Glucose oxidase

7.54

0.03

0

3.2

1qanA1

Rrna Methyltransferase Ermc'

RRNA_A_DIMETH

1i8tA1

UDP-galactopyranose mutase

7.54

0.04

0

4.1

2cut_1

Serine esterase

LIG_DEP

1jfrA1

Exfoliatus Lipase

7.43

0.17

0

5.3

1bp1_2

Bactericidal Permeability-increasing protein

LIG__PC

1fuoA10

Fumarase C

7.42

0.01

0

0.1

1hcy_4

Hexameric haemocyanin

LIG_NAG

2kinA2

Kinesin

7.42

0.01

0

2.2

1cpq_1

Cytochrome C

LIG_HEM

1wpoB1

Human Cytomegalovirus Protease

7.41

0.01

0

1.3

1inp_1

Inositol polyphosphate 1-phosphatase

IMP_2

1bgxT6

TAQ polymerase

7.38

0

0

0

1ksaA1

Bacteriochlorophyll A Protein

LIG_BCL

1xvaA1

Glycine N-Methyltransferase

7.27

0.02

0

1.3

1b63A1

MutL DNA mismatch repair protein

LIG_ANP

1wpoB1

Human Cytomegalovirus Protease

7.22

0.03

0

0.6

1e7uA1

Phosphoinositide 3-Kinase Inhibition

PI3_4_KINASE_1

1qi9B1

Vanadium Bromoperoxidase Soluble Quinoprotein Glucose

7.15

0.01

0

0.6

1a12A1

Regulator Of Chromosome Condensation (Rcc1)

RCC1_2

1cruB1

Dehydrogenase

7.06

0.08

0

0.4

In some cases we found a structural similarity between a protein with unknown function and two patches annotated with the same function, giving strength to the hypothesis of function-related similarity. The conserved hypothetical protein (Tm0667) from Thermotoga maritima (PDB code 1j6o) shows a structural similarity with surface patches of E. coli nucleotidyltransferase (1gupA2) and Desulfovibrio gigas rubredoxin:oxygen oxidoreductase (1e5dA4), both annotated with the iron binding ability. The E. coli lysozyme inhibitor (1gpq), whose function is still uncharacterized, may bind ATP given the similarity to the Rattus norvegicus 6-Phosphofructo-2-Kinase/ Fructose-2,6-Bisphosphatase major patch (1bif_1) and the mouse Aaa ATPase P97 (second patch (1e32A2)).

For each described match we propose that the detected structural similarity reveals a function-related similarity. For each match we checked whether the similarity could have been detected by means of sequence similarity, as checked using BLAST and PsiBLAST, or structural comparison, as checked by means of SSM and PINTS. Our approach, that is based on comparison of local functional surface residues, independently on their sequence order, may overcome the limitations of current methods possibly due to our incomplete knowledge of the sequence/structure/function relationship or to convergent evolution. Even using PINTS, which is a tool similar in philosophy to our approach, the findings are different, suggesting that different tools may be complementary in the difficult task of protein functional annotation; on the other hand, this may also highlight the difficulty in evaluating the significance of local similarities that in many cases are restricted to a very small number of residues.

Conclusion

The expected burst in the number of protein structures that are not associated to a biological function, stimulated by the structure genomics programs, has emphasized the need for tools to reveal structural regularities even in proteins that do not share sequence or fold similarity [1, 45]. Protein structures selected in structural genomics projects usually share very little sequence similarity with the dataset of already characterized proteins [46]. Sequence analysis tools are therefore unsuitable for inferring their functions. Moreover, cases are known where active site residues are not conserved in proteins sharing a common structural fold; therefore, "traditional" structure comparison tools are also not always able to help in function-related annotation.

Using a fully automated procedure, we obtained a reliable library of protein annotated functional sites. A fast structural comparison algorithm allows the rapid scanning of one or more protein structures with the library looking for local structural similarities. This method is designed to help in functional annotation in difficult cases. Our annotated surface patches determination and comparison method offers a new and powerful resource for detecting related function among unrelated proteins, for proteins solved in structural genomics projects or for identifying new function-related sites on the surface of already characterized proteins. We have been able to provide one or more functional clues to a large set of novel proteins, and, where functional evidences are already known, our findings confirm them. Moreover, just as proteins with different sequence and fold can share a similar functional site, proteins with similar sequence and/or fold can have small local differences leading to a completely different function [1, 21]. Our method, which is focused on a detailed analysis of functional sites, is able to successfully predict protein functions in these difficult cases. Therefore, it can be used in analyzing the complex evolutionary relationships among protein sequence, structure and function [4749]. The complete list of the functional predictions that we obtained is accessible at URL http://cbm.bio.uniroma2.it/surface/structuralGenomics.html; the structurally similar residues are shown for each match, and the structural superposition can be viewed through the browser plug-in Chime or RasMol. A novel publicly available web server, PdbFun [50], has been developed to allow the on-line structural comparison of user-defined subsets of residues of protein chains, and pre-defined subsets, like the SURFACE library of annotated functional sites, will be provided.

Methods

Functional site library extraction and annotation

The SURFACE database [32] stores a library of 1521 annotated function-related surface regions obtained using the following procedure (described in Figure 1): first, the SURFNET algorithm [51] is applied to a non-redundant, representative list of around 2000 protein chains from the PDB database [52] (downloadable at http://www.ncbi.nlm.nih.gov/Structure/VAST/nrpdb.html) in order to find all the surface clefts with a volume higher than an arbitrary threshold (200 Å3); then for each cleft, a surface patch is identified as a collection of solvent-exposed residues using the MASK algorithm (that is part of the SURFNET package); finally, we infer the function of such surface patches using two kinds of annotations: ability to bind (associated to surface patch residues that are contacting a bound ligand), and match with PROSITE or ELM [33, 34] functional motifs. The ability to bind annotation is carried out selecting those residues within 3.5 Å distance from any of the atoms of a ligand found in the crystal structure. Whenever a single patch contains more than 75% of the ligand-contacting residues (62% of the cases), we assign the ligand binding ability to this surface cleft. Considering only large organic molecules and metal ions, the ratio of the ligands that can be unequivocally associated to a single patch raises to 78%. PROSITE annotations are achieved scanning the sequences of monomers in our dataset using the ScanProsite algorithm [53], finding 928 matches. 12 matches were found with the ELM [34] experimentally verified instances. We did not consider those patterns marked by PROSITE as unspecific. Moreover, we annotated only those residues that correspond to non-X positions in the regular expression and that are exposed to the solvent according to the NACCESS procedure [54, 55]. Once the dataset chains have been annotated, we map the annotated residues on the structure and in the surface patches. Whenever a single patch contains more than 75% of the pattern exposed residues, we assign the function encoded by this pattern to the patch (43% of the cases).

Structural comparison

A sequence/fold-independent algorithm was used for local surface comparison [32]. The algorithm starts from a seed match (a pair of residues in the query that can be found in the target, at the same distance and with similar physical and chemical characteristics). The structural superposition, obtained by the quaternions method [56] and assessed at each step by residue similarity and root mean square deviation (r.m.s.d.) of the matching residues, is extended adding neighboring residues to the seed match until r.m.s.d and residue similarity are under user-defined thresholds (we used a similarity at least equal to 0.3 for each added pair of residues, and an average similarity at least equal to 1.2, using the Dayhoff substitution matrix [57] and 0.8Å as maximum r.m.s.d.). We consider only structural matches that include at least a fixed fraction (50%) of functional annotated residues, to increase the likelihood that the structural match is a function-related match as well. The algorithm is very fast and explores all the combinations of similar/identical residues in a sequence-independent way. The score of the match is the number of residues that can be superposed within the defined similarity thresholds. The significance of the score is evaluated by calculating the Z-score over the score distribution of the query patch comparison with the whole dataset: for each match, the Z-score is computed as the difference between the score of the match and the average score of all the matches for the query patch, divided by the standard deviation.

In order to obtain an estimate of the number of true positive matches, defining a true positive match as a structural similarity that implies also a functional similarity, we checked if the two matching proteins share also: (i) a common Gene Ontology (GO) term; (ii) a common SwissProt keyword; (iii) the same Enzyme Classification (EC) number; (iv) the same functional annotation (i.e. the binding of the same ligand or a match with the same PROSITE or ELM pattern). Gene Ontology terms search is limited to molecular function or biological process annotations linked to PDB structures from the GOA project [35]. SwissProt [58] keywords were extracted from the SwissProt entries corresponding to the DBREF field in the PDB [52] files header. If this was not available, we extracted the sequence from the order of residues in the structure, then we looked for a close homolog (sequence similarity higher than 95% using BLAST) in the SwissProt database. Some keywords were excluded because not referring to protein functions (i.e. Structural protein, Polymorphism, Alternative promoter usage, etc.). Furthermore, we checked whether the two matching proteins share more than 40% of sequence similarity or the same fold using the SCOP structural classification [39] at the superfamily level. Our database is composed of patches extracted from a non-redundant list of structures, therefore these cases are infrequent.
Table 3

Function prediction for uncharacterized proteins. Functional annotated sites have been used to infer the function(s) of a large set of uncharacterized proteins, using similarity threshold values that have been successfully tested on a training dataset. Columns: (i) PDB code and chain name of structural genomics proteins; (ii) PDB code, chain name and surface patch serial number of the functional annotated patch; (iii) Functional annotation of the matching patch; (iv) Z-score of the match; (v) Number of aligned residues; (vi) Blast2 bitscore; (vii) Sequence similarity evaluated by means of the Needleman-Wunsch global alignment (using the EMBOSS package 59 application needle). (viii) SSM Q score; (ix) SSM P score; (x) SSM Z score.

Str.gen

SURFACE patch

Annotation

Z-score

Score

BLAST2

Seq Sim

SSM Q

SSM P

SSM Z

1rtyC0

1qlaC1

LIG_HEM

14

7

0

0.5

0.06

0

1.5

1mw5A0

1bgyC2

LIG_HEM

13

9

0

0.8

0.04

0

1.8

1vhqA0

1ct9A1

LIG_AMP

13

8

13.9

1.2

0.02

0

1.8

1vhsB0

1cjwA1

LIG_COT

13

9

13.1

0.6

0.45

3.2

5.6

1vhsA0

1qsmD1

LIG_ACO

12

9

11.9

35.3

0.41

2.3

4.7

1j2rC0

19hcA1

LIG_HEM

12

8

12.7

0.4

0.01

0

1.5

1oz9A0

1fy7A1

LIG_COA

11

7

14.6

1.5

0.04

0

0.4

1vimA0

1dqrA1

LIG_6PG

10

7

15.4

1.1

0.07

0

3.1

1vj1A0

1tsdA1

LIG_UMP

10

6

13.9

3.2

0.01

0

0.2

1rtyA0

1fps_1

POLYPRENYL_SYNTHET_2

10

7

16.2

15

0.05

0

3.6

1vhnA0

2dorA1

DHODEHASE_2

10

8

13.5

0.7

0.23

0

5.7

1vhkA0

1grj_1

GREAB_1

10

7

0

2.5

0.02

0

2.2

1k7kA0

1qd1B1

LIG_FON

10

6

12.7

1

0.03

0

0.5

1vhkC0

1qd1B1

LIG_FON

10

6

13.5

2.1

0.04

0

2.5

1vhcA0

1bmtA2

LIG_COB

10

8

15

3.7

0.06

0

2

1uf9A0

1esmA1

LIG_COA

10

8

13.9

0.4

0.11

0

4.2

1h2hA0

1ezfA1

SQUALEN_PHYTOEN_SYN_1

10

7

13.1

1.3

0.02

0

0.4

1j5pA0

1ezfA1

SQUALEN_PHYTOEN_SYN_1

10

7

13.1

1.5

0.02

0

1

1rcuB0

2tpsB1

LIG_TPS

10

7

13.9

4.2

0.08

0

1.8

1vhcA0

2tpsB1

LIG_TPS

10

7

16.2

7.9

0.32

0.1

4.2

1jriC0

1atiA1

AA_TRNA_LIGASE_II_1

10

6

14.2

6.6

0.02

0

2.2

1j9jA0

1ft1A6

PPTA

10

7

14.2

1.9

0.02

0

0.6

1j9kB0

1ft1A6

PPTA

10

7

14.2

1.9

0.01

0

0.7

1i36A0

1eluA5

LIG_PDA

9

6

13.9

0.5

0.04

0

1.2

1j6pA0

1bxoA1

ASP_PROTEASE

9

6

13.9

2.7

0.02

0

0.3

1p5fA0

1eyrA1

LIG_CDP

9

6

21.6

33.2

0.06

0

1.9

1kytA0

1drmA1

LIG_HEM

9

6

12.3

0.9

0.02

0

1.6

1l6rB0

1drmA1

LIG_HEM

9

6

0

0.9

0.02

0

0.8

1j6rA0

1pprM1

LIG_DGD

9

6

0

3.7

0.01

0

1.4

1p99A0

1dik_1

LIG_SO4

9

6

14.2

1.7

0.07

0

0.7

1j2rD0

1dbtA1

OMPDECASE

9

6

15

2.5

0.07

0

1.6

1ni9A0

1pkp_1

RIBOSOMAL_S5

9

6

15.4

18.8

0.03

0

2.6

1lxnA0

1eg7A4

FTHFS_1

9

6

13.5

3.4

0.02

0

2.1

1rtyA0

1cpcB2

LIG_CYC

8

6

0

3.3

0.06

0

0.9

1vhnA0

1rblA1

LIG_CAP

8

6

14.2

1.5

0.09

0

2.9

1rtyA0

2cmd_1

MDH

8

6

13.9

19.8

0.02

0

0.9

1vj1A0

1hdoA1

LIG_NAP

8

6

14.6

3.2

0.07

0

3.3

1nc5A0

1aorA1

LIG_PTE

8

6

14.6

0.5

0.01

0

0.5

1rtwA0

1ft1A2

PPTA

8

6

13.1

11.7

0.03

0

1.7

1pg6A0

1qs0A1

LIG_TDP

8

6

13.5

0.2

0.02

0

0.9

1vizA0

1ho4B1

LIG_PXP

8

6

13.9

0.2

0.02

0

4.4

1l5xA0

1knyA1

LIG_APC

8

5

0

9.7

0.03

0

1.5

1vh6B0

1b72B1

HOMEOBOX_1

8

5

0

21.6

0.06

0.4

2.9

1mwqB0

19hcA1

CYTOCHROME_C

8

6

0

3.5

0.02

0

0.8

1s0uA0

1tplA1

BETA_ELIM_LYASE

8

6

15

0.6

0.02

0

2.6

1ixlA0

1ksaA1

LIG_BCL

8

6

0

2.2

0.04

0

1.9

1ufaA0

1nstA1

LIG_A3P

8

6

15.8

2.3

0.02

0

0.7

1rvkA0

2mnr_1

LIG__MN

8

6

14.2

39.5

0.05

9.3

9.8

1rvkA0

2mnr_1

MR_MLE_2

8

6

14.2

39.5

0.05

9.3

9.8

1vh6A0

1rdzA2

LIG_AMP

8

6

13.1

1.7

0.02

0

1

1ns5A0

1qjbB4

LIG_SEP

8

5

0

0.8

0.02

0

1.6

1rtyA0

1bcfA1

BACTERIOFERRITIN

8

6

16.2

7.3

0.14

0

0.9

1vi3A0

1a44_2

PBP

8

6

38.9

31.7

0.24

1

4.7

1j74A0

1dat_1

FERRITIN_1

8

6

15.8

5

0.00

0

0

1j7dA0

1dat_1

FERRITIN_1

8

6

15.8

0.7

0.00

0

0

1pc6A0

1qq8A1

HEME_OXYGENASE

8

6

0

0.9

0.03

0

1

1htwA0

1a4sA1

ALDEHYDE_DEHYDR_GLU

7

6

15

1.7

0.04

0

2

1vhmA0

1f5mB1

UPF0067

7

6

120

52.7

0.64

10

9.3

1vhmB0

1f5mB1

UPF0067

7

6

121

53.3

0.63

11.6

10.1

1rvkA0

2mnr_4

MR_MLE_1

7

6

14.2

39.5

0.05

9.3

9.8

1j6oA0

1e5dA4

LIG_FEO

7

6

14.2

0.3

0.03

0

0.1

1vhmA0

9icwA8

DNA_POLYMERASE_X

7

6

0

0.8

0.03

0

1.5

1qyiA0

2scpA1

EF_HAND

7

6

0

5.7

0.02

0

1.1

1nkvA0

1dhs_2

LIG_NAD

7

6

0

2

0.04

0

0.3

1nigA0

1c8zA1

TUB_2

7

6

0

0.7

0.01

0

4

1gpqB0

1bif_1

ATP_GTP_A

7

6

0

2.5

0.03

0

1

1p9vA0

1cjcA1

LIG_FAD

7

6

14.2

3.5

0.01

0

0.6

1vhmA0

1cjcA1

LIG_FAD

7

6

14.6

0.5

0.02

0

1

1vhmB0

1cjcA1

LIG_FAD

7

6

14.6

0.7

0.01

0

0.2

1lqlA0

1i78A5

OMPTIN_2

7

7

0

2.2

0.03

0

0

Declarations

Acknowledgements

The authors thank Gianni Cesareni and Arthur Lesk for helpful support and discussion. We gratefully acknowledge the support of Telethon GGP04273, GENEFUN, a PNR 2001–2003 (FIRB art.8) and a PNR 2003–2007 (FIRB art.8).

Authors’ Affiliations

(1)
Boston College, Biology Department
(2)
Centre for Molecular Bioinformatics, Department of Biology, University of Rome Tor Vergata

References

  1. Shapiro L, Harris T: Finding function through structural genomics. Curr Opin Biotechnol 2000, 11(1):31–35. 10.1016/S0958-1669(99)00064-6View ArticlePubMedGoogle Scholar
  2. Whisstock JC, Lesk AM: Prediction of protein function from protein sequence and structure. Q Rev Biophys 2003, 36(3):307–340. 10.1017/S0033583503003901View ArticlePubMedGoogle Scholar
  3. Fischer D, Norel R, Wolfson H, Nussinov R: Surface motifs by a computer vision technique: searches, detection, and implications for protein-ligand recognition. Proteins 1993, 16(3):278–292. 10.1002/prot.340160306View ArticlePubMedGoogle Scholar
  4. Norel R, Fischer D, Wolfson HJ, Nussinov R: Molecular surface recognition by a computer vision-based technique. Protein Eng 1994, 7(1):39–46.View ArticlePubMedGoogle Scholar
  5. Kauvar LM, Villar HO: Deciphering cryptic similarities in protein binding sites. Curr Opin Biotechnol 1998, 9(4):390–394. 10.1016/S0958-1669(98)80013-XView ArticlePubMedGoogle Scholar
  6. Lesk AM, Fordham WD: Conservation and variability in the structures of serine proteinases of the chymotrypsin family. J Mol Biol 1996, 258(3):501–537. 10.1006/jmbi.1996.0264View ArticlePubMedGoogle Scholar
  7. Fischer D, Wolfson H, Lin SL, Nussinov R: Three-dimensional, sequence order-independent structural comparison of a serine protease against the crystallographic database reveals active site similarities: potential implications to evolution and to protein folding. Protein Sci 1994, 3(5):769–778.PubMed CentralView ArticlePubMedGoogle Scholar
  8. Contreras JA, Karlsson M, Osterlund T, Laurell H, Svensson A, Holm C: Hormone-sensitive lipase is structurally related to acetylcholinesterase, bile salt-stimulated lipase, and several fungal lipases. Building of a three-dimensional model for the catalytic domain of hormone-sensitive lipase. J Biol Chem 1996, 271(49):31426–31430. 10.1074/jbc.271.49.31426View ArticlePubMedGoogle Scholar
  9. Kobayashi N, Go N: ATP binding proteins with different folds share a common ATP-binding structural motif. Nat Struct Biol 1997, 4(1):6–7. 10.1038/nsb0197-6View ArticlePubMedGoogle Scholar
  10. Via A, Ferre F, Brannetti B, Valencia A, Helmer-Citterich M: Three-dimensional view of the surface motif associated with the P-loop structure: cis and trans cases of convergent evolution. J Mol Biol 2000, 303(4):455–465. 10.1006/jmbi.2000.4151View ArticlePubMedGoogle Scholar
  11. Hwang KY, Chung JH, Kim SH, Han YS, Cho Y: Structure-based identification of a novel NTPase from Methanococcus jannaschii. Nat Struct Biol 1999, 6(7):691–696. 10.1038/10745View ArticlePubMedGoogle Scholar
  12. Wistow G, Piatigorsky J: Recruitment of enzymes as lens structural proteins. Science 1987, 236(4808):1554–1556.View ArticlePubMedGoogle Scholar
  13. Holm L, Sander C: An evolutionary treasure: unification of a broad set of amidohydrolases related to urease. Proteins 1997, 28(1):72–82. 10.1002/(SICI)1097-0134(199705)28:1<72::AID-PROT7>3.0.CO;2-LView ArticlePubMedGoogle Scholar
  14. Ganfornina MD, Sanchez D: Generation of evolutionary novelty by functional shift. Bioessays 1999, 21(5):432–439. 10.1002/(SICI)1521-1878(199905)21:5<432::AID-BIES10>3.0.CO;2-TView ArticlePubMedGoogle Scholar
  15. Todd AE, Orengo CA, Thornton JM: Plasticity of enzyme active sites. Trends Biochem Sci 2002, 27(8):419–426. 10.1016/S0968-0004(02)02158-8View ArticlePubMedGoogle Scholar
  16. Laskowski RA, Luscombe NM, Swindells MB, Thornton JM: Protein clefts in molecular recognition and function. Protein Sci 1996, 5(12):2438–2452.PubMed CentralPubMedGoogle Scholar
  17. Kleywegt GJ: Recognition of spatial motifs in protein structures. J Mol Biol 1999, 285(4):1887–1897. 10.1006/jmbi.1998.2393View ArticlePubMedGoogle Scholar
  18. Rosen M, Lin SL, Wolfson H, Nussinov R: Molecular shape comparisons in searches for active sites and functional similarity. Protein Eng 1998, 11(4):263–277. 10.1093/protein/11.4.263View ArticlePubMedGoogle Scholar
  19. Preissner R, Goede A, Rother K, Osterkamp F, Koert U, Froemmel C: Matching organic libraries with protein-substructures. J Comput Aided Mol Des 2001, 15(9):811–817. 10.1023/A:1013158818807View ArticlePubMedGoogle Scholar
  20. Kinoshita K, Nakamura H: Identification of protein biochemical functions by similarity search using the molecular surface database eF-site. Protein Sci 2003, 12(8):1589–1595. 10.1110/ps.0368703PubMed CentralView ArticlePubMedGoogle Scholar
  21. Schmitt S, Kuhn D, Klebe G: A new method to detect related function among proteins independent of sequence and fold homology. J Mol Biol 2002, 323(2):387–406. 10.1016/S0022-2836(02)00811-2View ArticlePubMedGoogle Scholar
  22. Binkowski TA, Adamian L, Liang J: Inferring functional relationships of proteins from local sequence and spatial surface patterns. J Mol Biol 2003, 332(2):505–526. 10.1016/S0022-2836(03)00882-9View ArticlePubMedGoogle Scholar
  23. Stark A, Sunyaev S, Russell RB: A model for statistical significance of local similarities in structure. J Mol Biol 2003, 326(5):1307–1316. 10.1016/S0022-2836(03)00045-7View ArticlePubMedGoogle Scholar
  24. Stark A, Russell RB: Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures. Nucleic Acids Res 2003, 31(13):3341–3344. 10.1093/nar/gkg506PubMed CentralView ArticlePubMedGoogle Scholar
  25. Stark A, Shkumatov A, Russell RB: Finding functional sites in structural genomics proteins. Structure (Camb) 2004, 12(8):1405–1412. 10.1016/j.str.2004.05.012View ArticleGoogle Scholar
  26. Singh R, Saha M: Identifying structural motifs in proteins. Pac Symp Biocomput 2003, 228–239.Google Scholar
  27. Chen BY, Fofanov VY, Kristensen DM, Kimmel M, Lichtarge O, Kavraki LE: Algorithms for structural comparison and statistical analysis of 3D protein motifs. Pac Symp Biocomput 2005, 334–345.Google Scholar
  28. Schmollinger M, Fischer I, Nerz C, Pinkenburg S, Gotz F, Kaufmann M, Lange KJ, Reuter R, Rosenstiel W, Zell A: ParSeq: searching motifs with structural and biochemical properties. Bioinformatics 2004, 20(9):1459–1461. 10.1093/bioinformatics/bth083View ArticlePubMedGoogle Scholar
  29. Torrance JW, Bartlett GJ, Porter CT, Thornton JM: Using a library of structural templates to recognise catalytic sites and explore their evolution in homologous families. J Mol Biol 2005, 347(3):565–581. 10.1016/j.jmb.2005.01.044View ArticlePubMedGoogle Scholar
  30. Wangikar PP, Tendulkar AV, Ramya S, Mali DN, Sarawagi S: Functional sites in protein families uncovered via an objective and automated graph theoretic approach. J Mol Biol 2003, 326(3):955–978. 10.1016/S0022-2836(02)01384-0View ArticlePubMedGoogle Scholar
  31. Pal D, Eisenberg D: Inference of protein function from protein structure. Structure (Camb) 2005, 13(1):121–130. 10.1016/j.str.2004.10.015View ArticleGoogle Scholar
  32. Ferre F, Ausiello G, Zanzoni A, Helmer-Citterich M: SURFACE: a database of protein surface regions for functional annotation. Nucleic Acids Res 2004, 32(Database):D240–244. 10.1093/nar/gkh054PubMed CentralView ArticlePubMedGoogle Scholar
  33. Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A: The PROSITE database, its status in 2002. Nucleic Acids Res 2002, 30(1):235–238. 10.1093/nar/30.1.235PubMed CentralView ArticlePubMedGoogle Scholar
  34. Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M, Cameron S, Martin DM, Ausiello G, Brannetti B, Costantini A, et al.: ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 2003, 31(13):3625–3630. 10.1093/nar/gkg545PubMed CentralView ArticlePubMedGoogle Scholar
  35. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32(Database):D262–266. 10.1093/nar/gkh021PubMed CentralView ArticlePubMedGoogle Scholar
  36. Alberts IL, Nadassy K, Wodak SJ: Analysis of zinc binding sites in protein crystal structures. Protein Sci 1998, 7(8):1700–1716.PubMed CentralView ArticlePubMedGoogle Scholar
  37. Tainer JA, Roberts VA, Getzoff ED: Protein metal-binding sites. Curr Opin Biotechnol 1992, 3(4):378–387. 10.1016/0958-1669(92)90166-GView ArticlePubMedGoogle Scholar
  38. Barondeau DP, Getzoff ED: Structural insights into protein-metal ion partnerships. Curr Opin Struct Biol 2004, 14(6):765–774. 10.1016/j.sbi.2004.10.012View ArticlePubMedGoogle Scholar
  39. Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 2004, 32(Database):D226–229. 10.1093/nar/gkh039PubMed CentralView ArticlePubMedGoogle Scholar
  40. McCarthy AA, Baker HM, Shewry SC, Patchett ML, Baker EN: Crystal structure of methylmalonyl-coenzyme A epimerase from P. shermanii: a novel enzymatic function on an ancient metal binding scaffold. Structure (Camb) 2001, 9(7):637–646. 10.1016/S0969-2126(01)00622-0View ArticleGoogle Scholar
  41. Fraaije MW, Mattevi A: Flavoenzymes: diverse catalysts with recurrent features. Trends Biochem Sci 2000, 25(3):126–132. 10.1016/S0968-0004(99)01533-9View ArticlePubMedGoogle Scholar
  42. Dym O, Eisenberg D: Sequence-structure analysis of FAD-containing proteins. Protein Sci 2001, 10(9):1712–1728. 10.1110/ps.12801PubMed CentralView ArticlePubMedGoogle Scholar
  43. Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 1999, 174(2):247–250. 10.1016/S0378-1097(99)00149-4View ArticlePubMedGoogle Scholar
  44. Krissinel E, Henrick K: Protein structure comparison in 3D based on secondary structure matching (SSM) followed by Ca alignment, scored by a new structural similarity function. Proceedings of the 5th International Conference on Molecular Structural Biology, Vienna, September 3–7 2003 2003., 88:Google Scholar
  45. Teichmann SA, Murzin AG, Chothia C: Determination of protein function, evolution and interactions by structural genomics. Curr Opin Struct Biol 2001, 11(3):354–363. 10.1016/S0959-440X(00)00215-3View ArticlePubMedGoogle Scholar
  46. Chance MR, Bresnick AR, Burley SK, Jiang JS, Lima CD, Sali A, Almo SC, Bonanno JB, Buglino JA, Boulton S, et al.: Structural genomics: a pipeline for providing structures for the biologist. Protein Sci 2002, 11(4):723–738. 10.1110/ps.4570102PubMed CentralView ArticlePubMedGoogle Scholar
  47. Todd AE, Orengo CA, Thornton JM: Evolution of protein function, from a structural perspective. Curr Opin Chem Biol 1999, 3(5):548–556. 10.1016/S1367-5931(99)00007-1View ArticlePubMedGoogle Scholar
  48. Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA: From structure to function: approaches and limitations. Nat Struct Biol 2000, 7(Suppl):991–994. 10.1038/80784View ArticlePubMedGoogle Scholar
  49. Irving JA, Whisstock JC, Lesk AM: Protein structural alignments and functional genomics. Proteins 2001, 42(3):378–382. 10.1002/1097-0134(20010215)42:3<378::AID-PROT70>3.0.CO;2-3View ArticlePubMedGoogle Scholar
  50. Ausiello G, Zanzoni A, Peluso D, Via A, Helmer-Citterich M: pdbFun: mass selection and fast comparison of annotated PDB residues. Nucleic Acids Research 2005, 33(Web server issue):W133–7. 10.1093/nar/gki499PubMed CentralView ArticlePubMedGoogle Scholar
  51. Laskowski RA: SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 1995, 13(5):323–330. 307–328 10.1016/0263-7855(95)00073-9View ArticlePubMedGoogle Scholar
  52. Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, et al.: The Protein Data Bank. Acta Crystallogr D Biol Crystallogr 2002, 58(Pt 6 No 1):899–907. 10.1107/S0907444902003451View ArticlePubMedGoogle Scholar
  53. Gattiker A, Bienvenut WV, Bairoch A, Gasteiger E: FindPept, a tool to identify unmatched masses in peptide mass fingerprinting protein identification. Proteomics 2002, 2(10):1435–1444. 10.1002/1615-9861(200210)2:10<1435::AID-PROT1435>3.0.CO;2-9View ArticlePubMedGoogle Scholar
  54. Hubbard S, Thornton JM: NACCESS, Computer Program. In Department of Biochemistry and Molecular Biology. University College London; 1993.Google Scholar
  55. Hubbard SJ, Campbell SF, Thornton JM: Molecular recognition. Conformational analysis of limited proteolytic sites and serine proteinase protein inhibitors. J Mol Biol 1991, 220(2):507–530. 10.1016/0022-2836(91)90027-4View ArticlePubMedGoogle Scholar
  56. Coutsias EA, Seok C, Dill KA: Using quaternions to calculate RMSD. J Comput Chem 2004, 25(15):1849–1857. 10.1002/jcc.20110View ArticlePubMedGoogle Scholar
  57. Schwartz R, Dayhoff M: Matrices for detecting distant relationships. Foundation NBR. Washington DC; 1979.Google Scholar
  58. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31(1):365–370. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
  59. Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16(6):276–277. 10.1016/S0168-9525(00)02024-2View ArticlePubMedGoogle Scholar

Copyright

© Ferrè et al; licensee BioMed Central Ltd. 2005

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.