PASS2: an automated database of protein alignments organised as structural superfamilies
© Bhaduri et al; licensee BioMed Central Ltd. 2004
Received: 28 November 2003
Accepted: 02 April 2004
Published: 02 April 2004
The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins.
An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database.
The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at http://www.ncbs.res.in/~faculty/mini/campass/pass2.html
Classification of proteins into families is performed on the basis of the similarity of sequences to the family members [1, 2]. Importantly, however, detectable global sequence similarity in a protein family is not required for retention of the three-dimensional fold and only a very small number of conserved functional residues are required for biochemical activity amongst proteins belonging to a superfamily . Establishing evolutionary relationships between superfamily members having similar structure and function but sequentially diverged is challenging. Over 49,000 domains deposited in the Protein Data Bank (PDB)  are organized in different databases by hierarchical classification schemes or in terms of structural neighbourhood distances [5–7]. SCOP (1.63 release) records 49,497 protein domains, grouped into merely 765 folds, suggesting a strong structural convergence of proteins. Homologous families can be easily grouped by simple sequence searches whereas superfamily members, adopting the same fold and performing similar biological roles [8–13] can often be identified by sensitive fold prediction algorithms followed by a careful alignment of sequences.
Availability of reliable sequence alignments for distantly related proteins despite poor sequence identity permits better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. In addition, the construction of three-dimensional models using homology modelling techniques are usually reliable where the sequence identity between query and the structural homologues (templates) are 30% or above. Analyses of structural and sequence differences amongst known superfamily members can hopefully provide useful guidelines for modelling distantly related proteins. PASS2 database [14, 15] presents alignments of sequentially distant proteins related at the superfamily level. We report an automated, updated version of the superfamily alignment database that is in direct correspondence with SCOP (1.63) database.
Construction and content
The present version of PASS2 consider domains as assigned in SCOP 1.63 . Domains within a superfamily, no more than 40% identical with each other, have been considered for curating the database. The choice of 40% cut-off in percentage sequence identity, as compared to the previous version of PASS2 that works at 25% identity level, was to reduce the number of single-member superfamilies. The 4,001 protein domains were assigned 1,194 superfamilies spanning the seven classes of proteins and were thus chosen for structure based sequence alignments.
Curation of alignments
Utility and discussion
Assigning new structural entries to pre-existing superfamilies
Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new three-dimensional protein structures deposited in the Protein Data Bank. PASS2 allows classification of three-dimensional domains into respective superfamilies based on sequential and structural properties. Sequence of the uploaded structure is compared to the hidden markov models of PASS2 and assigned to superfamilies on the basis of liberal expectation values (E = 1.0). Representative structures of the putative superfamilies have been superposed with the query using LSQMAN , thus associating the query to a particular superfamily. Alternatively, the user can superpose an uploaded structure to specific superfamilies.
Predicting superfamilies and alignment for sequences
Links have been provided to popular sequence search methods like PSI-BLAST  and PHI-BLAST , which may be employed to associate unannotated sequences to PASS2 superfamilies. A sequence to probabilistic profile match method Hmmpfam  can also be used for similar assignment. Sequence alignments for a query sequence can be obtained with superfamily members using MALIGN . 3-dimensional features can also be attributed to the sequence alignment using JOY .
Hidden markov models for PASS2
Comparision of the number of hits obtained in HMMSearch using models derived from regular multiple sequence alignments and structure based sequence alignments.
Hits obtained from PASS2 HMMs
Hits obtained from superfamily HMMs
Anticodon-binding domain of class I aminoacyl-tRNA synthetases
Cyclophilin (peptidylprolyl isomerase)
Superfamily members in the genome database
PASS2 has several new features to associate the structure-based sequences to their homologues in various genome databases. Sequence homologues of the superfamilies have been searched in the non-redundant sequence database using PSI-BLAST  and Hmmsearch . For the PSI-BLAST searches, individual member for each superfamily was queried against the non-redundant sequence database. The expectation value was set to 0.001 with 20 iterations. Hidden Markov Models for every superfamily was built using structural alignments (as explained above). These models were searched against the non-redundant database to enrich the sequence members using the Hmmsearch program belonging to the HMM suite applying an E-value threshold of 0.1. A third approach has been to employ interacting motifs, identified for superfamilies, as constraints in PHI-BLAST against searches in the non-redundant database using an E-value 1.0 as explained elsewhere . Hits obtained by the three approaches belonging to the genomes were aligned using CLUSTALW  and presented along with their structural representatives of the superfamilies. The top 10 hits displayed in the web are aligned with PASS2 members. The entire set of hits corresponding to genomes can also be downloaded.
Information about superfamily members
A structure-based sequence alignment for the query with the appropriate superfamily can be obtained. Superposed coordinates for the query with the best ranking superfamily (based on the RMSD value) is also provided. Motifs represent invariant regions of the superfamily and are helpful in protein design, engineering and folding studies. Spatially conserved interacting motifs are identified as described elsewhere  for each superfamily and are listed in the current version of the database along with psuedoenergies for their spatial interactions (Bhaduri et al., in press). Corresponding links to the structural motifs of superfamily (SMoS) database  can also be accessed.
Phylogenetic analysis aids in the understanding of the diversity among the members. Diversification of structural members may be studied in terms of the dissimilarity of structure or divergence of the sequences. The database has been linked to other useful protein databases as in the previous version of PASS2 .
PASS2 and its applications
PASS2 is a compendium of structure-based sequence alignments of distantly related proteins grouped at the superfamily level in direct correspondence with SCOP definitions. Furthermore, PASS2 acts as a 'junction' point to obtain links of representative superfamily members to genome, sequence and structural databases. Phylogenies of superfamily members provide a crude but quantitative estimate of evolutionary relationships among the members. Motifs explain the invariant regions of proteins acting as descriptors for the superfamily. HMM models can be useful in identifying more members. Availability of such alignment databases over the World Wide Web facilitates the study and design of experiments on specific superfamilies. They also enable systematic survey and analysis of various structural properties for performing fold predictions. The database may be accessed and downloaded across the World Wide Web.
Associating different proteins with structurally similar and evolutionarily related proteins enhance our functional understanding of protein superfamily. The multiple alignments of distantly related representatives are particularly informative and often reveal a signature of invariantly conserved residues. Access to sequence alignments of distantly related proteins over the World Wide Web offers the possibility to study and design experiments on specific superfamilies. They also permit systematic survey and analysis of various structural properties and to perform fold predictions.
Availability of PASS2 database
PASS2 is accessible at http://www.ncbs.res.in/~faculty/mini/campass/pass2.html
R.S. is a Senior Research Fellow funded by the Wellcome Trust, U.K. GP is supported by the Wellcome Trust. Financial and infrastructural support from NCBS (TIFR) is also acknowledged.
- Rossmann MG, Moras D, Olsen KW: Chemical and biological evolution of nucleotide-binding protein. Nature 1974, 250: 194–199.View ArticlePubMedGoogle Scholar
- Lesk AM, Chothia C: How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J Mol Biol 1980, 136: 225–270.View ArticlePubMedGoogle Scholar
- Reddy BV, Li WW, Shindyalov IN, Bourne PE: Conserved key amino acid positions (CKAAPs) derived from the analysis of common substructures in proteins. Proteins 2001, 42: 148–163. 10.1002/1097-0134(20010201)42:2<148::AID-PROT20>3.0.CO;2-RView ArticlePubMedGoogle Scholar
- Bernstein FC, Koetzle TF, Williams GJ, Meyer Jr EF, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M: The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol 1977, 112: 535–542.View ArticlePubMedGoogle Scholar
- Holm L, Sander C: The FSSP database of structurally aligned protein fold families. Nucleic Acids Res 1994, 22: 3600–3609.PubMed CentralPubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH-a hierarchic classification of protein domain structures. Structure 1997, 5: 1093–1108.View ArticlePubMedGoogle Scholar
- Blundell TL, Bedarkar S, Rinderknecht E, Humbel RE: Insulin-like growth factor 1. a model for tertiary structure accounting for immunoreactivity and receptor binding. Proc Natl Acad Sci (USA) 1978, 75: 180–184.View ArticleGoogle Scholar
- Chothia C: Principles that determine the structures of proteins. Ann Rev Biochem 1984, 53: 537–572. 10.1146/annurev.bi.53.070184.002541View ArticlePubMedGoogle Scholar
- Murthy MRN: A fast method of comparing protein structure. FEBS Letts 1984, 168: 97–102. 10.1016/0014-5793(84)80214-8View ArticleGoogle Scholar
- Holm L, Ouzounis C, Sander C, Tuparev G, Vriend G: A database of protein-structure families with common folding motifs. Protein Sci 1992, 1: 1691–1698.PubMed CentralView ArticlePubMedGoogle Scholar
- Russell RB, Barton GJ: Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts, secondary structure and accessibility. J Mol Biol 1994, 244: 332–350. 10.1006/jmbi.1994.1733View ArticlePubMedGoogle Scholar
- Orengo CA, Jones DT, Thornton JM: Protein superfamilies and domain superfolds. Nature 1994, 372: 631–634. 10.1038/372631a0View ArticlePubMedGoogle Scholar
- Sowdhamini R, Burke DF, Huang JF, Mizuguchi K, Nagarajaram HA, Srinivasan N, Steward RE, Blundell TL: CAMPASS: a database of structurally aligned protein superfamilies. Structure 1998, 6: 1087–1094.View ArticlePubMedGoogle Scholar
- Mallika V, Bhaduri A, Sowdhamini R: PASS2: a semi-automated database of Protein Alignments Organised as Structural Superfamilies. Nucleic Acids Res 2002, 30: 284–288. 10.1093/nar/30.1.284PubMed CentralView ArticlePubMedGoogle Scholar
- Kleywegt GJ, Jones TA: A super position. CCP4/ESF-EACBM Newsletter on Protein Crystallography 1994, 31: 9–14.Google Scholar
- Russell RB, Barton GJ: Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts, secondary structure and accessibility. Proteins 1992, 14: 309–323.View ArticlePubMedGoogle Scholar
- Mizuguchi K, Deane CM, Blundell TL, Johnson MS, Overington JP: JOY: protein sequence-structure representation and analysis. Bioinformatics 1998, 14: 617–623. 10.1093/bioinformatics/14.7.617View ArticlePubMedGoogle Scholar
- Sali A, Blundell TL: Definition of general topology equivalence in protein structures-a procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J Mol Biol 1990, 212: 403–428.View ArticlePubMedGoogle Scholar
- Sutcliffe MJ, Haneef I, Carney D, Blundell TL: Knowledge based modelling of homologous proteins, Part I: Three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Eng 1987, 1: 377–384.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Z, Schaffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF: Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res 1998, 26: 3986–3990. 10.1093/nar/26.17.3986PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Johnson MS, Overington JP, Blundell TL: Alignment and searching for common protein folds using a data bank of structural templates. J Mol Biol 1993, 231: 735–752. 10.1006/jmbi.1993.1323View ArticlePubMedGoogle Scholar
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998, 284: 1201–1210. 10.1006/jmbi.1998.2221View ArticlePubMedGoogle Scholar
- Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 1994, 235: 1501–1531. 10.1006/jmbi.1994.1104View ArticlePubMedGoogle Scholar
- Hughey R, Krogh A: Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS 1996, 12: 95–107.PubMedGoogle Scholar
- Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 2001, 313: 903–919. 10.1006/jmbi.2001.5080View ArticlePubMedGoogle Scholar
- Bhaduri A, Ravishankar R, Sowdhamini R: Conserved spatially interacting motifs of protein superfamilies: Application to fold recognition and function annotation of genome data. Proteins: Structure, Function and Bioinformatics 2004, 54: 657–670. 10.1002/prot.10638View ArticleGoogle Scholar
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003, 31: 3497–3500. 10.1093/nar/gkg500PubMed CentralView ArticlePubMedGoogle Scholar
- Chakrabarti S, Venkatramanan K, Sowdhamini R: SMoS: a database of structural motifs of superfamilies. Prot Engng 2003, 16: 791–793. 10.1093/protein/gzg110View ArticleGoogle Scholar
- Kraulis PJ: MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J Appl Cryst 1991, 24: 946–50. 10.1107/S0021889891004399View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.