- Methodology article
- Open Access
CMASA: an accurate algorithm for detecting local protein structural similarity and its application to enzyme catalytic site annotation
© Li and Huang; licensee BioMed Central Ltd. 2010
Received: 26 November 2009
Accepted: 27 August 2010
Published: 27 August 2010
The rapid development of structural genomics has resulted in many "unknown function" proteins being deposited in Protein Data Bank (PDB), thus, the functional prediction of these proteins has become a challenge for structural bioinformatics. Several sequence-based and structure-based methods have been developed to predict protein function, but these methods need to be improved further, such as, enhancing the accuracy, sensitivity, and the computational speed. Here, an accurate algorithm, the CMASA (Contact MAtrix based local Structural Alignment algorithm), has been developed to predict unknown functions of proteins based on the local protein structural similarity. This algorithm has been evaluated by building a test set including 164 enzyme families, and also been compared to other methods.
The evaluation of CMASA shows that the CMASA is highly accurate (0.96), sensitive (0.86), and fast enough to be used in the large-scale functional annotation. Comparing to both sequence-based and global structure-based methods, not only the CMASA can find remote homologous proteins, but also can find the active site convergence. Comparing to other local structure comparison-based methods, the CMASA can obtain the better performance than both FFF (a method using geometry to predict protein function) and SPASM (a local structure alignment method); and the CMASA is more sensitive than PINTS and is more accurate than JESS (both are local structure alignment methods). The CMASA was applied to annotate the enzyme catalytic sites of the non-redundant PDB, and at least 166 putative catalytic sites have been suggested, these sites can not be observed by the Catalytic Site Atlas (CSA).
The CMASA is an accurate algorithm for detecting local protein structural similarity, and it holds several advantages in predicting enzyme active sites. The CMASA can be used in large-scale enzyme active site annotation. The CMASA can be available by the mail-based server (http://126.96.36.199/other1/CMASA/CMASA.htm).
With the development of both the genome project and the structural genomics, large of unknown functional protein structures were deposited in PDB, these protein functions need to be annotated. In addition, because of the fast development of bioinformatics, some known structure and function proteins may also need to be re-annotated. Thus, several methods of protein structural and functional prediction were developed, which can be classified as the sequence-based and the structure-based methods.
Sequence-based methods, such as, BLAST/PSI-BLAST[1, 2] or PROSITE, are based on the concept of "similar protein sequences with similar function". The performance of these methods critically depends on the sequence similarity between the query structure and annotated structure. However, these methods may fail to detect the remote homologous and convergent proteins. In addition, the changes of some key residues may also result in the change of protein function, even though their sequence identities are very high. For example, VRK3, a member of kinase family, have lost its function as kinase and become into regulating other kinase activity, because the key ATP binding sites were mutated . Thus, sequence-based methods may also fail to annotate the functional diversified proteins.
Structure-based methods contain the global and local structure comparison methods. Though the global structure-based methods, such as DALI , VAST , SSM  and CE , can detect the remote homologue proteins, they fail to detect the functional convergence of some proteins with different fold. For example, the enzymes with different folds, the trypsins and subtilisins, can hold same function of hydrolysis , but the global structure comparisons can not detect them each other. Some proteins with similar structures can perform different functions , but the global structure-based methods can not detect the functional divergence of some proteins with same fold.
The local structure-based methods can detect the functional convergence and predict the functional sites for those proteins with the less annotated structures, for example, FFF, PINTS, SPASM, JESS, Query3d , ASSAM , Cavbase, ef-Site  and so on. The FFF can search local structural similarities by the local structural geometry characters and the contact matrix constraint by user predefined, which has been successfully applied in predicting the active sites of glutaredoxins/thioredoxins and T1 ribonucleases. Other methods search local structural similarities by the recursive enumeration strategy. The core of this strategy is to extend initial candidate solutions[14, 19]. Thus, the performances of these methods depend critically on the constraints that can extend the partial candidate solution quickly and accurately . As the constraint, the Max Inter-Distance Deviation (MIDD) is well applied in most of the local structure comparison algorithms [12, 13]. The results of these algorithms are sorted by the RMSD or the RMSD based P-value [12–14, 20]. However, there is no restricting relationship between the MIDD and the RMSD. So the MIDD may be larger, but the RMSD may be very small in values. To obtain the better sensitivity and accuracy, the users have to define a larger MIDD, but the CPU time will increase dramatically. Thus, it is very difficult to balance the time cost and the performance.
In order to improve the local structure-based methods, the CMASA (Contact MAtrix based local Structural Alignment algorithm) has been used. The design requirements of CMASA are, as follows, (i) it should be not only fast but also sensitive and accurate enough for the large-scale structural annotation; (ii). It should be flexible enough for the different applications.
Three CMASA's databases have been generated for different applications, 1) nrCSA from catalytic sites atlas (CSA) for predicting enzyme active sites, 2) nrPDB and nrSCOP database for detecting remote homologues and convergent cases.
Overview of the CMASA
The CMASA for detecting the local structure similarity can have different applications, when different databases are used. 1) The putative functional prediction of structured proteins by active sites database (now only nrCSA database is available). For example, the structure of 2qjw has been deposited by Joint Center for Structural Genomics (JCSG), but its function is unknown yet. When the PSI-BlAST is used, the 2qjw can hit nothing in PDB database and tens of hypothetical proteins in non-redundant (NR) database (p < 0.0005). These results suggest that 2qjw is a protein with unknown function. When the CMASA is used, the 2qjw can hit the 1a88, a haloperoxidase with the p-value of 6.7 × 10-10. And the 2qjw active sites predicted by the CMASA are S81, D129 and H155, which are same as the 1a88. Thus, the 2qjw may have haloperoxidase function. 2) The same catalytic sites in non-homologous proteins caused by convergent evolution can be observed by the nrPDB or nrSCOP. For example, the catalytic sites of 1djz (EC number: EC188.8.131.52) are H311, E431 and H356, 1djz catalytic sites can hit 2ddr (P-value = 0.02) with the EC number of EC184.108.40.206 by the CMASA, and the H311, E431 and H356 in 1djz are corresponding to the H296, D253 and H151 in the 2ddr. Thus, the results suggest that the both 1djz and 2ddr should hold same transformational reaction. Actually, both H296 and D253 are the catalytic residues in 2ddr, although 1djz and 2ddr hold different folds, where 1djz is belonged to TIM beta/alpha-barrel fold, but 2ddr is belonged to DNase I-like fold. Thus, the transformational mechanism between 1djz and 2ddr may be resulted from convergent evolution.
The running speed of CMASA is also fast. When CMASA was running on the personal computer with the Intel Core 2 Duo E8400 3.0 GHz CPU, a protein with 400 residues is as the seed to search against the nrCSA database (1320 templates) by the defeat settings, the time cost is about 6 s (seconds). When the active sites including 3 residues are used to search against the nrSCOP (14541 structures) by the defeat settings, the time cost is about 30~60 s. The CMASA mail-based server (http://220.127.116.11/other1/CMASA/CMASA.htm) will reply the mails and give the searching results within 3 minutes if the server is not too busy.
Constraint analysis of CMASA
Sensitivity and Accuracy of CMASA
164 CSA families were selected to test the overall sensitivity and accuracy of the CMASA (Additional file 1). For each family, two different templates were chosen to search the training set, which contains both family members' structures and a constant negative dataset. One is the master template, and another is the mean conformational template (MCT). For each template, the negative dataset is a subset of the nrPDB(10582 structures), where the nrCSA and nrEC have been excluded.
The table of mean MCC, sensitivity and accuracy
Mean conformational template
Comparison of CMASA with both sequence-based and global structure-based methods
Comparison between CMASA and other local structure comparing methods
Some local structural comparing methods have been applied in the enzyme active site annotation, such as, FFF, SPASM, PINTS, Query3d and JESS. The different residue representations and different searchable databases are used in these five methods above. For example, FFF only used the Ca atom, JESS used both Ca and Cb (beta carbon) atoms, SPASM used both Ca and a pseudo atom that is the geometrical centre of the residue. However, different methods have different searchable databases. Thus, it is difficult to compare them in overall. So we used some examples to evaluate the advantages and disadvantages between these five methods and the CMASA.
PINTS have its own website, so we used 2ity (a protein kinase) active sites to search PINTS SCOP_specials database (a database of PINTS, same as nrSCOP database in this work). Interestingly, only 1 kinase can be hit and the best hit is not kinase. On the contrast, the CMASA can hit 20 kinases with the false positive rate <0.3. We also used the active sites of 1mct to search PINTS SCOP_specials database (SCOP version 1.61), 4 of 54 trypsins and 4 of 10 subtilisins can be hit. But 85 of 101 trypsins and 15 of 21 subtilisins can be hit by the CMASA. Thus, the sensitivity of CMASA is better than that of PINTS.
The Query3d is powerful to find similar local structures within two proteins. But it may be weak to detect the similarity between an active site and a protein. For example, we used 1k2p (a protein kinase) active sites to search PDB database in the Qurey3d website, not any hit is shown. However, the CMASA can give more than 20 positives (P-value < 0.01) and the reasonable rank. Thus, the Query3d may be not suitable for enzyme catalytic site annotation.
The JESS had used CSA families to evaluated the algorithm performance. So the overall performance between the CMASA and the JESS were compared. The results had showed that the JESS had the maximum mean MCC of 0.83 with the mean sensitivity of 0.86 and the mean accuracy of 0.84 . However, the CMASA can hold the mean MCC of 0.90 with the mean sensitivity of 0.86 and the mean accuracy of 0.96. Thus, the accuracy of the CMASA is higher than that of the JESS (Figure 7C).
Large scale annotation of enzyme catalytic sites
The above results suggest that the sensitivity and accuracy of the CMASA may be enough for doing the large scale functional annotation. So the CMASA is also tried to annotate the enzyme catalytic sites. All the proteins in nrPDB were searched against the mean conformational template (MCT) dataset (1320 templates) by the CMASA, and 263 structures has been characterized, which are not annotated by the CSA2.2.9 (P-value < 1.0 × 10-4). In fact, 166 of 263 have been deposited before 2008 (Additional file 3). Thus, these results demonstrate that at least 166 putative novel catalytic sites can not be annotated by CSA (Additional file 3).
Cases of enzyme catalytic site prediction
Two cases were used to evaluate the CMASA advantages further. The structure of 3BDV is from the Joint Center for Structural Genomics (JCSG) that aims to develop high-throughput methods for protein production, crystallization, and structure determination. 3BDV is belonged to UDF1234 family with unknown function. 3BDV can hit 1JKM of a serine hydrolase (P-value = 3.40 × 10-5) by the CMASA, and the catalytic residues of 3BDV are predicted as S81, D135 and H162. Further sequence analysis shows that the whole UDF1234 family members are conserved in the sites of S81, D135 and H162 (Additional file 4), which suggest that the entire UDF1234 family members probably have a function similar as serine hydrolase with S-D-H active sites.
The catalytic sites of an arylsulfatase (PDBid: 1HDH) have been annotated in the CSA. However, the catalytic sites of its one homologue (PDBid: 1P49) can not be found in the CSA. The PSI-BLAST result suggested that the 1P49 catalytic sites mismatch result fails to be annotated, because of the low sequence identity (Additional file 5). However, 1P49 can hit 1HDH with high confidence (P-value = 3.5 × 10-13) by the CMASA, and the catalytic residues are predicted as R79, K134, H136, H290, D342 and K368 (Additional file 5). The structural information  convinces this prediction.
An accurate algorithm, the CMASA, has been developed to detect the local protein structural similarity, which can not only search the similar functional proteins by query the active sites, but also predict an unknown protein function, including distant homologous proteins or convergent proteins, by searching to functional active site database.
When the CMAD is used as the constraint and the Ca/Fa atoms are used to represent the residues, the balance between sensitivity, accuracy and the time cost can be reached. The CMASA is fast by testing on PC, and maintains sensitive and highly accurate (>0.94) for searching enzyme active sites. So, the CMASA may be helpful for improving the large scale annotation.
The CMASA has been compared to other methods. These methods contain the sequence-based, the global structure-based and five local structure-based methods. The results suggest that the CMASA can get better performance than all of these methods in detecting enzyme active site similarity. PSI-BLAST has been used to annotate the enzyme catalytic sites, but it is weak at annotating distant homologous proteins and convergent proteins. So the CMASA is an effective method to annotate the distant homologous or convergent protein/enzyme active sites.
Of course, some limitations can be found in the CMASA, for example, i) the protein structures are required; ii) the structural difference of the side chains between the query and hit active residues will affect the sensitivity.
The CMASA is not only highly accurate but also sensitive and fast for detecting the local protein structural similarity. It can be applied in annotating the distant homologous or convergent protein/enzyme active sites. And at least putative 166 novel catalytic sites have been suggested by the CMASA. A mail-based server has been available.
To insure the accuracy and reduce the complexity, CMASA used all amino acids of the structures and each residue is represented by both Ca (alpha-carbon atom) and Fa (furthest atom from Ca). In addition, the only Ca and only Fa are also used to evaluate how well these two terms in combination provides more predictive performance.
The flowchart of CMASA was showed in Figure 1. First, the CMASA parsed the query and decided whether the query search the nrPDB/the nrSCOP or the nrCSA database. Second, the CMASA used substitute matrix to emulate all candidate matches. Third, the CMASA used Contact Matrix Average Deviation (CMAD) to filter the candidates. Forth, the RMSD and the RMSD based P-value are calculated to score the matches. Fifth, ranking the matches
Calculation of the RMSD
Where R is the rotation matrix and R0 is the translation matrix. Here, we use Nelder-Mead Simplex Method to solve this problem. This method uses the concept of a simplex, which is a special polytope of N + 1 vertex in N dimensions, and is commonly used nonlinear optimization. The rotation matrix R is equal to Rx(α)* Ry(β)*Rz(γ)* Ry(β)T* Rx(α)T. Because the CMASA only superposes less than 10 amino acids, and because the geometric centre of two similar local structures should be superposited, the R0 can be pre-calculated through query geometric centre minus the match geometric centre. Therefore, the dimension of the simplex is N = 3. Then, using the Nelder-Mead Simplex minimization function in the GSL(GNU Scientific Library, which is a free numerical library for C and C++ programmers) or "fminsearch" function in Matlab/Octave (a software for computation and engineering), RMSD can be calculated.
Where EF is expected number of matches with the RMSD or better, RM is the RMSD. N is the total number of query residues, Φ is the product of the percentage abundances of all residues. a and b are empirically constants: a = 473, b = 0.4.
The nrPDB(non-redundant PDB, 18757 structures) was directly from PDB (Version released on the 01-AUG-2008). All protein chains of at least 20 amino acids were clustered by blastclust (included in the BLAST package) at 90% sequence similarity. Each cluster was ranked by structure resolution. The highest rank in each cluster was regarded as the represent structure. The overlap between the nrPDB and the pdbEC (http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/data/pdb_EC) is the non-redundant pdbEC (5189 structures).
The nrSCOP (non-redundant structures from SCOP) is from the SCOP(Version 1.75). The SCOP database has 7 levels: root, class, fold, superfamily, family, protein and species. In each species level, only the first structure was selected. All of these selected structures formed the nrSCOP (14541 structures).
The nrCSA (non-redundant catalytic sites atlas) is from the catalytic sites atlas (CSA) with the version2.2.9. Some CSA templates that only contain one or two residues are removed, because one amino acid means nothing for catalytic mechanism, and because only two amino acids will give too much noise in the CMASA results. Rather more, some CSA templates, which have 3 residues but 2 of them are glycines, are also removed. These "4 atoms" CSA templates (1 Fa and 3 Ca atoms) are similar as only two amino acid templates (also 4 atoms: 2 Fa and 2 Ca atoms), so these templates will also give too much noise in the CMASA results. The overlap between the nrPDB and the CSA is the nrCSA.
Master templates and Mean conformational templates (MCT)
Where RMSD(i,j) is the RMSD between ith and jth template in the group; n is the number of the templates in the group.
Where (x',y',z'), means the three-dimensional coordinates of jth template which have superposed to the master template; m is the number of the templates with RMSD(master,j)≤1.5 Å.
Sensitivity and specificity analysis
Two methods are used for evaluating the CMASA performance. One is the ROC curve, another is the Matthews correlation coefficient(MCC). The ROC curve is used for comparing the CMASA and other methods. The ROC curve is the plot of the true positives (Tp) rate and the false positives (Fp) rate.
Where Tp, Tn, Fp and Fn are the true positives, true negatives, false positives and false negatives, respectively.
164 CSA families are used for evaluating the CMASA overall performance. These families are generated by the following steps: 1) the nrCSA members with same EC number are grouped together. 2) In each group, these members with same active sites are grouped to a CSA family. These families with less than 3 members are discarded. As a result, we got 164 CSA families to analysis the sensitivity and specificity (Supplement Table S1). The negative data set (10582 structures) is a subset of the nrPDB, which is deposited before 2008 and excludes the nrCSA and enzymes.
For each 164 CSA families, both the master template and the mean conformational template are generated to query against a training set, which is the combination of the family members(positives) and a constant negative data set (10582 structures). All hits of 164 CSA families are combined and ranked by the P-value or the RMSD. So there are 1033 positives (sum of 164 families' positives) and 10582 negatives. Then, the overall MCC, sensitivity and accuracy are calculated (Figure 5A and 5B).
The overall optimal threshold is defined as RMSD or P-value where the overall MCC is at a maximum. After the overall optimal threshold is defined, the hits of each family, where the RMSD or the P-value is small than the overall optimal threshold, are used to calculate the MCC, sensitivity and accuracy in each family (Additional file 2).
This work was supported by the National Basic Research Program of China (Grant No. 2007CB815705; 2009CB941300), the National Natural Science Foundation of China (Grant No. 30623007) and Chinese Academy of Sciences (Grant No. 2007211311091).
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389View ArticlePubMedPubMed CentralGoogle Scholar
- Sigrist CJ, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P: PROSITE: a documented database using patterns and profiles as motif descriptors. Briefings in bioinformatics 2002, 3(3):265–274. 10.1093/bib/3.3.265View ArticlePubMedGoogle Scholar
- Scheeff ED, Eswaran J, Bunkoczi G, Knapp S, Manning G: Structure of the pseudokinase VRK3 reveals a degraded catalytic site, a highly conserved kinase fold, and a putative regulatory binding site. Structure 2009, 17(1):128–138. 10.1016/j.str.2008.10.018View ArticlePubMedPubMed CentralGoogle Scholar
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. Journal of molecular biology 1993, 233(1):123–138. 10.1006/jmbi.1993.1489View ArticlePubMedGoogle Scholar
- Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins 1995, 23(3):356–369. 10.1002/prot.340230309View ArticlePubMedGoogle Scholar
- Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta crystallographica 2004, 60(Pt 12 Pt 1):2256–2268.PubMedGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein engineering 1998, 11(9):739–747. 10.1093/protein/11.9.739View ArticlePubMedGoogle Scholar
- Chen P, Tsuge H, Almassy RJ, Gribskov CL, Katoh S, Vanderpool DL, Margosiak SA, Pinko C, Matthews DA, Kan CC: Structure of the human cytomegalovirus protease catalytic domain reveals a novel serine protease fold and catalytic triad. Cell 1996, 86(5):835–843. 10.1016/S0092-8674(00)80157-9View ArticlePubMedGoogle Scholar
- Gerlt JA, Babbitt PC: Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annual review of biochemistry 2001, 70: 209–246. 10.1146/annurev.biochem.70.1.209View ArticlePubMedGoogle Scholar
- Fetrow JS, Skolnick J: Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. Journal of molecular biology 1998, 281(5):949–968. 10.1006/jmbi.1998.1993View ArticlePubMedGoogle Scholar
- Stark A, Russell RB: Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures. Nucleic acids research 2003, 31(13):3341–3344. 10.1093/nar/gkg506View ArticlePubMedPubMed CentralGoogle Scholar
- Kleywegt GJ: Recognition of spatial motifs in protein structures. Journal of molecular biology 1999, 285(4):1887–1897. 10.1006/jmbi.1998.2393View ArticlePubMedGoogle Scholar
- Barker JA, Thornton JM: An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics (Oxford, England) 2003, 19(13):1644–1649. 10.1093/bioinformatics/btg226View ArticleGoogle Scholar
- Ausiello G, Via A, Helmer-Citterich M: Query3d: a new method for high-throughput analysis of functional residues in protein structures. BMC bioinformatics 2005, 6(Suppl 4):S5. 10.1186/1471-2105-6-S4-S5View ArticlePubMedPubMed CentralGoogle Scholar
- Spriggs RV, Artymiuk PJ, Willett P: Searching for patterns of amino acids in 3D protein structures. Journal of chemical information and computer sciences 2003, 43(2):412–421.PubMedGoogle Scholar
- Schmitt S, Kuhn D, Klebe G: A new method to detect related function among proteins independent of sequence and fold homology. Journal of molecular biology 2002, 323(2):387–406. 10.1016/S0022-2836(02)00811-2View ArticlePubMedGoogle Scholar
- Kinoshita K, Nakamura H: Identification of protein biochemical functions by similarity search using the molecular surface database eF-site. Protein Sci 2003, 12(8):1589–1595. 10.1110/ps.0368703View ArticlePubMedPubMed CentralGoogle Scholar
- Gherardini PF, Helmer-Citterich M: Structure-based function prediction: approaches and applications. Briefings in functional genomics & proteomics 2008, 7(4):291–302.View ArticleGoogle Scholar
- Stark A, Sunyaev S, Russell RB: A model for statistical significance of local similarities in structure. Journal of molecular biology 2003, 326(5):1307–1316. 10.1016/S0022-2836(03)00045-7View ArticlePubMedGoogle Scholar
- Lagarias JC, Reeds JA, Wright MH, Wright PE: Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions. SIAM Journal of Optimization 1998, 9(1):112–147. 10.1137/S1052623496303470View ArticleGoogle Scholar
- Porter CT, Bartlett GJ, Thornton JM: The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 2004, (32 Database):D129–133. 10.1093/nar/gkh028Google Scholar
- Ago H, Oda M, Takahashi M, Tsuge H, Ochi S, Katunuma N, Miyano M, Sakurai J: Structural basis of the sphingomyelin phosphodiesterase activity in neutral sphingomyelinase from Bacillus cereus. J Biol Chem 2006, 281(23):16157–16167. 10.1074/jbc.M601089200View ArticlePubMedGoogle Scholar
- Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 2004, (32 Database):D226–229. 10.1093/nar/gkh039Google Scholar
- Laskowski RA: PDBsum new things. Nucleic acids research 2008.Google Scholar
- Fawcett T: An introduction to ROC analysis. Pattern Recognition Letters 2006, 27(8):861–874. 10.1016/j.patrec.2005.10.010View ArticleGoogle Scholar
- Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et biophysica acta 1975, 405(2):442–451.View ArticlePubMedGoogle Scholar
- Torrance JW, Bartlett GJ, Porter CT, Thornton JM: Using a library of structural templates to recognise catalytic sites and explore their evolution in homologous families. Journal of molecular biology 2005, 347(3):565–581. 10.1016/j.jmb.2005.01.044View ArticlePubMedGoogle Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al.: The Pfam protein families database. Nucleic Acids Res 2008, (36 Database):D281–288.Google Scholar
- Boltes I, Czapinska H, Kahnert A, von Bulow R, Dierks T, Schmidt B, von Figura K, Kertesz MA, Uson I: 1.3 A structure of arylsulfatase from Pseudomonas aeruginosa establishes the catalytic mechanism of sulfate ester cleavage in the sulfatase family. Structure 2001, 9(6):483–491. 10.1016/S0969-2126(01)00609-8View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic acids research 2000, 28(1):235–242. 10.1093/nar/28.1.235View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.