- Open Access
PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics
© von Grotthuss et al; licensee BioMed Central Ltd. 2006
- Received: 06 August 2005
- Accepted: 06 February 2006
- Published: 06 February 2006
The number of protein structures from structural genomics centers dramatically increases in the Protein Data Bank (PDB). Many of these structures are functionally unannotated because they have no sequence similarity to proteins of known function. However, it is possible to successfully infer function using only structural similarity.
Here we present the PDB-UF database, a web-accessible collection of predictions of enzymatic properties using structure-function relationship. The assignments were conducted for three-dimensional protein structures of unknown function that come from structural genomics initiatives. We show that 4 hypothetical proteins (with PDB accession codes: 1VH0, 1NS5, 1O6D, and 1TO0), for which standard BLAST tools such as PSI-BLAST or RPS-BLAST failed to assign any function, are probably methyltransferase enzymes.
We suggest that the structure-based prediction of an EC number should be conducted having the different similarity score cutoff for different protein folds. Moreover, performing the annotation using two different algorithms can reduce the rate of false positive assignments. We believe, that the presented web-based repository will help to decrease the number of protein structures that have functions marked as "unknown" in the PDB file.
- Protein Data Bank
- Protein Data Bank File
- Enzymatic Classification
- Protein Data Bank Chain
- Structural Genomic Initiative
Over 30 structural genomics centers have been established worldwide with the common goal of large-scale, high-throughput structure determination using X-ray crystallography and NMR. One challenge is to predict the function of the proteins from their three-dimensional structures, primarily those which have no detectable sequence similarity to any protein of known function. Currently, the total size of the Protein Data Bank (PDB) is more than 32,000 entries, which contain over 29,000 different (63,000 redundant) protein chains. Many of the PDB chains have been mapped to Enzymatic Classification (EC) numbers via the Swiss-Prot database. The mapping information has been presented as a PDBSprotEC database , which is available on the Internet. SCOPEC  is another web-based repository which is similar to PDBSprotEC collection. The SCOPEC set contains a description of the protein catalytic domains with assigned enzyme function. Prediction of protein function has been conducted using sequence similarity in both web-accessible databases. There is no doubt the PDBSprotEC and SCOPEC databases are full of very useful EC number assignments. However, none of these services contains predictions for proteins that have no sequence similarity to known enzymes. Moreover, neither PDBSprotEC nor SCOPEC includes any data for recently deposited PDB structures. The "youngest" annotated in PDBSprotEC or SCOPEC protein was released by PDB in August 2004 or in February 2003, respectively. Therefore, we decided to use the structure-function relationship [7–9] for automatic assignment of the EC number to 499 protein structures that came from the structural genomics centers and whose function is marked as "unknown" in the PDB file. All assignments are combined into a web-accessible database, which will be updated as soon as the new structures from structural genomics projects are released. Because most of these PDB entries are still not published, we believe that our repository will help to reduce the number of proteins that have functions marked as "unknown" in the PDB file.
Construction and content
Two different strategies were applied to annotate the proteins with EC numbers: namely 3D-Hit and 3D-Fun. The first method simply scans using the 3D-Hit program  a sequentially non-redundant database of structures that are characterized by four cutoff values. Each value is defined by the highest, known score of structural similarity to any protein with different enzyme function at the corresponding or lower EC level. In the 3D-Hit strategy, the EC number of the protein with the strongest structural similarity is completely (or partially) assigned to the query, if the similarity score is greater than all (or any) of the cutoff values. As an example; let us consider a query protein which has the 3D-Hit score = 150 to the enzyme with the EC number 188.8.131.52 and the cutoff values = 100, 120, 180, 200, respectively. This structure will obtain an EC number assignment of 1.2.?.?.
All structural similarity scores are used for annotation in the 3D-Fun strategy. First, the query structure and all sequentially non-redundant proteins are hierarchically clustered (grouped) by structural similarity using complete-link algorithm[13, 14]. Next, the EC number is completely (or partially) assigned to each group in each clustering iteration, if all of the enzymes in the group have the same function at all (or any) of the EC levels; otherwise the EC number is assigned as unknown. As an example let us consider a cluster that contains 4 structures: the query protein and 3 enzymes with EC numbers 184.108.40.206, 220.127.116.11, and 18.104.22.168. This cluster will obtain an EC number assignment of 1.2.?.?. For the final prediction, the enzymatic function of the smallest cluster which contains the query structure is used. In the contrary to the 3D-Hit strategy, the 3D-Fun algorithm takes into account the enzymatic function of all structures that have greater values of similarity to the query than to all other proteins of the whole set.
We used both presented algorithms to infer the EC number for the 499 proteins from structural genomics that are currently available and have unknown functions. In order to avoid over-annotation due to partial EC numbers we used Green and Karp recommendation . If 3D-Hit and 3D-Fun methods were inconsistent in predicting enzyme function at any EC level it was indicated with a '?' symbol in its corresponding position (e.g. 2.3.4.?). If assignments were fully consistent, we indicated it with an 'n' in the fourth EC level (e.g. 2.3.4.n) which means that exact activity of this enzyme was predicted, but a sequence number has not been yet assigned by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB).
Utility and discussion
Structural genomics initiatives tend to target structures that are less typical of the PDB as a whole and so the cutoffs derived from the whole PDB may not be entirely applicable. Therefore, we analyzed 58 structures with predicted EC numbers, which were recently published and functionally annotated since this may give a truer indication of the accuracy. We found only one additional (except described above) incorrect prediction: 1VGY had been characterized as a succinyl diaminopimelate desuccinylase (3.5.1.?) while metallocarboxypeptidases function (3.4.17.n) was assigned. All such predictions will be manually corrected. However, as more structures are solved in the Protein Data Bank, the PDB-UF method will be more and more accurate and human intervention will not be required.
Example of PDB-UF record
The PDB-UF database is a collection of assigned EC numbers to protein structures of unknown function, which come from the structural genomics centers. Structure-based prediction of the EC number was conducted having different cutoff values for a different protein folds. In order to reduce the number of false positives the annotation was performed using the Meta-strategy. The web-based repository will be updated automatically when new protein structures are released.
We are indebted to Gert Vriend for his critical reading of the manuscript. MvG would like to thank the Foundation for Polish Science for the fellowship. The work was supported by 6FP GeneFun (LSHG-CT-2004-503567) and DataGenome (LSHB-CT-2003-503017) grants and by the Polish Ministry of Science and Information.
- Chen L, Oughtred R, Berman HM, Westbrook J: TargetDB: a target registration database for structural genomics projects. Bioinformatics 2004, 20: 2860–2862. 10.1093/bioinformatics/bth300View ArticlePubMedGoogle Scholar
- Stawiski EW, Gregoret LM, Mandel-Gutfreund Y: Annotating nucleic acid-binding function based on protein structure. J Mol Biol 2003, 326: 1065–1079. 10.1016/S0022-2836(03)00031-7View ArticlePubMedGoogle Scholar
- Sussman JL, Lin D, Jiang J, Manning NO, Prilusky J, Ritter O, Abola EE: Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr 1998, 54: 1078–1084. 10.1107/S0907444998009378View ArticlePubMedGoogle Scholar
- Bairoch A, Boeckmann B: The SWISS-PROT protein sequence data bank. Nucleic Acids Res 1991, 19 Suppl: 2247–2249.View ArticlePubMedGoogle Scholar
- Martin AC: PDBSprotEC: a Web-accessible database linking PDB chains to EC numbers via SwissProt. Bioinformatics 2004, 20: 986–988. 10.1093/bioinformatics/bth048View ArticlePubMedGoogle Scholar
- George RA, Spriggs RV, Thornton JM, Al-Lazikani B, Swindells MB: SCOPEC: a database of protein catalytic domains. Bioinformatics 2004, 20 Suppl 1: I130-I136. 10.1093/bioinformatics/bth948View ArticlePubMedGoogle Scholar
- Shakhnovich BE, Dokholyan NV, DeLisi C, Shakhnovich EI: Functional fingerprints of folds: evidence for correlated structure-function evolution. J Mol Biol 2003, 326: 1–9. 10.1016/S0022-2836(02)01362-1View ArticlePubMedGoogle Scholar
- Pal D, Eisenberg D: Inference of protein function from protein structure. Structure 2005, 13: 121–130. 10.1016/j.str.2004.10.015View ArticlePubMedGoogle Scholar
- Laskowski RA, Watson JD, Thornton JM: ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res 2005, 33: W89–93. 10.1093/nar/gki414PubMed CentralView ArticlePubMedGoogle Scholar
- Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 2001, 307: 1113–1143. 10.1006/jmbi.2001.4513View ArticlePubMedGoogle Scholar
- Rost B: Enzyme function less conserved than anticipated. J Mol Biol 2002, 318: 595–608. 10.1016/S0022-2836(02)00016-5View ArticlePubMedGoogle Scholar
- Plewczynski D, Pas J, von Grotthuss M, Rychlewski L: 3D-Hit: fast structural comparison of proteins. Appl Bioinformatics 2002, 1: 223–225.PubMedGoogle Scholar
- Defays D: An Efficient Algorithm for a Complete Link Method. The Computer Journal 1977, 20: 364–366. 10.1093/comjnl/20.4.364View ArticleGoogle Scholar
- Murtagh F: A survey of recent advances in hierarchical clustering algorithms. The Computer Journal 1983, 26: 354–359.View ArticleGoogle Scholar
- Green ML, Karp PD: Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res 2005, 33: 4035–4039. 10.1093/nar/gki711PubMed CentralView ArticlePubMedGoogle Scholar
- Holm L, Sander C: Mapping the protein universe. Science 1996, 273: 595–603.View ArticlePubMedGoogle Scholar
- Ginalski K, von Grotthuss M, Grishin NV, Rychlewski L: Detecting distant homology with Meta-BASIC. Nucleic Acids Res 2004, 32: W576–81.PubMed CentralView ArticlePubMedGoogle Scholar
- Forouhar F, Yang Y, Kumar D, Chen Y, Fridman E, Park SW, Chiang Y, Acton TB, Montelione GT, Pichersky E, Klessig DF, Tong L: Structural and biochemical studies identify tobacco SABP2 as a methyl salicylate esterase and implicate it in plant innate immunity. Proc Natl Acad Sci U S A 2005, 102: 1773–1778. 10.1073/pnas.0409227102PubMed CentralView ArticlePubMedGoogle Scholar
- Badger J, Sauder JM, Adams JM, Antonysamy S, Bain K, Bergseid MG, Buchanan SG, Buchanan MD, Batiyenko Y, Christopher JA, Emtage S, Eroshkina A, Feil I, Furlong EB, Gajiwala KS, Gao X, He D, Hendle J, Huber A, Hoda K, Kearins P, Kissinger C, Laubert B, Lewis HA, Lin J, Loomis K, Lorimer D, Louie G, Maletic M, Marsh CD, Miller I, Molinari J, Muller-Dieckmann HJ, Newman JM, Noland BW, Pagarigan B, Park F, Peat TS, Post KW, Radojicic S, Ramos A, Romero R, Rutter ME, Sanderson WE, Schwinn KD, Tresser J, Winhoven J, Wright TA, Wu L, Xu J, Harris TJ: Structural analysis of a set of proteins resulting from a bacterial genomics project. Proteins 2005, 60: 787–796. 10.1002/prot.20541View ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, Liebert CA, Liu C, Madej T, Marchler GH, Mazumder R, Nikolskaya AN, Panchenko AR, Rao BS, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Vasudevan S, Wang Y, Yamashita RA, Yin JJ, Bryant SH: CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 2003, 31: 383–387. 10.1093/nar/gkg087PubMed CentralView ArticlePubMedGoogle Scholar
- Elkins PA, Watts JM, Zalacain M, van Thiel A, Vitazka PR, Redlak M, Andraos-Selim C, Rastinejad F, Holmes WM: Insights into catalysis by a knotted TrmD tRNA methyltransferase. J Mol Biol 2003, 333: 931–949. 10.1016/j.jmb.2003.09.011View ArticlePubMedGoogle Scholar
- Ahn HJ, Kim HW, Yoon HJ, Lee BI, Suh SW, Yang JK: Crystal structure of tRNA(m1G37)methyltransferase: insights into tRNA recognition. Embo J 2003, 22: 2593–2603. 10.1093/emboj/cdg269PubMed CentralView ArticlePubMedGoogle Scholar
- Anantharaman V, Koonin EV, Aravind L: SPOUT: a class of methyltransferases that includes spoU and trmD RNA methylase superfamilies, and novel superfamilies of predicted prokaryotic RNA methylases. J Mol Microbiol Biotechnol 2002, 4: 71–75.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.