PASS2: an automated database of protein alignments organised as structural superfamilies

Background The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. Description An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database. Conclusions The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at


Background
Classification of proteins into families is performed on the basis of the similarity of sequences to the family members [1,2].Importantly, however, detectable global sequence similarity in a protein family is not required for retention of the three-dimensional fold and only a very small number of conserved functional residues are required for biochemical activity amongst proteins belonging to a superfamily [3].Establishing evolutionary relationships between superfamily members having similar structure and function but sequentially diverged is challenging.Over 49,000 domains deposited in the Protein Data Bank (PDB) [4] are organized in different databases by hierarchical classification schemes or in terms of structural neighbourhood distances [5][6][7].SCOP (1.63 release) records 49,497 protein domains, grouped into merely 765 folds, suggesting a strong structural convergence of proteins.Homologous families can be easily grouped by simple sequence searches whereas superfamily members, adopting the same fold and performing similar biological roles [8][9][10][11][12][13] can often be identified by sensitive fold prediction algorithms followed by a careful alignment of sequences.
Availability of reliable sequence alignments for distantly related proteins despite poor sequence identity permits better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies.In addition, the construction of threedimensional models using homology modelling techniques are usually reliable where the sequence identity between query and the structural homologues (templates) are 30% or above.Analyses of structural and sequence differences amongst known superfamily members can hopefully provide useful guidelines for modelling distantly related proteins.PASS2 database [14,15] presents alignments of sequentially distant proteins related at the superfamily level.We report an automated, updated version of the superfamily alignment database that is in direct correspondence with SCOP (1.63) database.

Construction and content
The present version of PASS2 consider domains as assigned in SCOP 1.63 [6].Domains within a superfamily, no more than 40% identical with each other, have been considered for curating the database.The choice of 40% cut-off in percentage sequence identity, as compared to the previous version of PASS2 that works at 25% identity level, was to reduce the number of single-member superfamilies.The 4,001 protein domains were assigned 1,194 superfamilies spanning the seven classes of proteins and were thus chosen for structure based sequence alignments.

Curation of alignments
Structural domains, obtained consulting SCOP [6] definitions, have been grouped at the superfamily level and superposed by rigid-body superposition (Figure 1).An initial superposition for all the structural domains belonging to each non-redundant superfamily was performed using LSQMAN [16] or STAMP 4.0 [17].LSQMAN [16] was used for superposing two member superfamilies while STAMP 4.0 [17] was utilised in multi-member superfamilies.From the coarse alignment, equivalent regions were identified using JOY [18].COMPARER [19] was employed to derive a refined alignment and superposition for the structures.Superposition was achieved by the choice of 'initial equivalencies' that served as seeds for pairwise rigid-body superposition using PMNFC, a modified form of MNYFIT [20] (Figure 2).The final alignment was presented using the three-dimensional structural features of JOY [18] (Figure 3).

Assigning new structural entries to pre-existing superfamilies
Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new three-dimensional protein structures deposited in the Protein Data Bank.PASS2 allows classification of threedimensional domains into respective superfamilies based on sequential and structural properties.Sequence of the uploaded structure is compared to the hidden markov models of PASS2 and assigned to superfamilies on the basis of liberal expectation values (E = 1.0).Representative structures of the putative superfamilies have been superposed with the query using LSQMAN [16], thus associating the query to a particular superfamily.Alternatively, the user can superpose an uploaded structure to specific superfamilies.

Predicting superfamilies and alignment for sequences
Links have been provided to popular sequence search methods like PSI-BLAST [21] and PHI-BLAST [22], which may be employed to associate unannotated sequences to PASS2 superfamilies.A sequence to probabilistic profile match method Hmmpfam [23] can also be used for similar assignment.Sequence alignments for a query sequence can be obtained with superfamily members using MALIGN [24].3-dimensional features can also be attributed to the sequence alignment using JOY [18].

Hidden markov models for PASS2
During search for sequence homologues and sequence assignment, profile-based methods perform better compared to those that use pairwise comparisons [25].Family profiles based on hidden markov models are popular probabilistic models applied for sequence annotations and searches [26,27].Structure-based sequence alignment of respective superfamilies in PASS2 provides a reliable basis for building hidden markov models.We provide HMMs, built using HMM suite [23], for superfamily alignments corresponding to the latest version of PASS2.The performances of these HMMs have been compared with models built using their structural homologues present in the PDB [28].Search for homologues have been performed on the non-redundant sequence database using both sets of models.Higher coverage has been obtained (Table 1) for superfamilies using PASS2 HMMs suggesting their value in sensitive sequence searches.Hidden markov models for both the structure-based sequence alignments and the sequence enriched superfamily alignments can be downloaded from the World Wide Web.

Superfamily members in the genome database
PASS2 has several new features to associate the structurebased sequences to their homologues in various genome databases.Sequence homologues of the superfamilies have been searched in the non-redundant sequence database using PSI-BLAST [21] and Hmmsearch [23].For the PSI-BLAST searches, individual member for each superfamily was queried against the non-redundant sequence database.The expectation value was set to 0.001 with 20 iterations.Hidden Markov Models for every superfamily was built using structural alignments (as explained above).These models were searched against the nonredundant database to enrich the sequence members using the Hmmsearch program belonging to the HMM suite applying an E-value threshold of 0.1.A third approach has been to employ interacting motifs, Flowchart representation of the steps involved in the curation of PASS2 database Figure 1 Flowchart representation of the steps involved in the curation of PASS2 database.Listed are useful tools and additional derived information that may be obtained from PASS2.identified for superfamilies, as constraints in PHI-BLAST against searches in the non-redundant database using an E-value 1.0 as explained elsewhere [29].Hits obtained by the three approaches belonging to the genomes were aligned using CLUSTALW [30] and presented along with their structural representatives of the superfamilies.The top 10 hits displayed in the web are aligned with PASS2 members.The entire set of hits corresponding to genomes can also be downloaded.

Information about superfamily members
A structure-based sequence alignment for the query with the appropriate superfamily can be obtained.Superposed coordinates for the query with the best ranking super-family (based on the RMSD value) is also provided.Motifs represent invariant regions of the superfamily and are helpful in protein design, engineering and folding studies.Spatially conserved interacting motifs are identified as described elsewhere [29] for each superfamily and are listed in the current version of the database along with psuedoenergies for their spatial interactions (Bhaduri et al., in press).Corresponding links to the structural motifs of superfamily (SMoS) database [31] can also be accessed.
Phylogenetic analysis aids in the understanding of the diversity among the members.Diversification of structural members may be studied in terms of the dissimilarity of structure or divergence of the sequences.The database has been linked to other useful protein databases as in the previous version of PASS2 [15].

PASS2 and its applications
PASS2 is a compendium of structure-based sequence alignments of distantly related proteins grouped at the superfamily level in direct correspondence with SCOP definitions.Furthermore, PASS2 acts as a 'junction' point to obtain links of representative superfamily members to genome, sequence and structural databases.Phylogenies of superfamily members provide a crude but quantitative estimate of evolutionary relationships among the members.Motifs explain the invariant regions of proteins acting as descriptors for the superfamily.HMM models can be useful in identifying more members.Availability of such alignment databases over the World Wide Web facilitates the study and design of experiments on specific superfamilies.They also enable systematic survey and analysis of various structural properties for performing fold predictions.The database may be accessed and downloaded across the World Wide Web.

Conclusions
Associating different proteins with structurally similar and evolutionarily related proteins enhance our functional understanding of protein superfamily.The multiple alignments of distantly related representatives are particularly informative and often reveal a signature of invariantly conserved residues.Access to sequence alignments of distantly related proteins over the World Wide Web offers the possibility to study and design experiments on specific superfamilies.They also permit systematic survey and analysis of various structural properties and to perform fold predictions.
Representative structure-based sequence alignment for the cytochrome superfamily Figure 3 Representative structure-based sequence alignment for the cytochrome superfamily.The six members have been aligned and represented incorporating the three-dimensional features of JOY [18].
Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime."Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours -you keep the copyright Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp BioMedcentral BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/35