Motivated Proteins: A web application for studying small three-dimensional protein motifs
© Leader and Milner-White; licensee BioMed Central Ltd. 2009
Received: 11 December 2008
Accepted: 11 February 2009
Published: 11 February 2009
Small loop-shaped motifs are common constituents of the three-dimensional structure of proteins. Typically they comprise between three and seven amino acid residues, and are defined by a combination of dihedral angles and hydrogen bonding partners. The most abundant of these are αβ-motifs, asx-motifs, asx-turns, β-bulges, β-bulge loops, β-turns, nests, niches, Schellmann loops, ST-motifs, ST-staples and ST-turns.
We have constructed a database of such motifs from a range of high-quality protein structures and built a web application as a visual interface to this.
The web application, Motivated Proteins, provides access to these 12 motifs (with 48 sub-categories) in a database of over 400 representative proteins. Queries can be made for specific categories or sub-categories of motif, motifs in the vicinity of ligands, motifs which include part of an enzyme active site, overlapping motifs, or motifs which include a particular amino acid sequence. Individual proteins can be specified, or, where appropriate, motifs for all proteins listed. The results of queries are presented in textual form as an (X)HTML table, and may be saved as parsable plain text or XML. Motifs can be viewed and manipulated either individually or in the context of the protein in the Jmol applet structural viewer. Cartoons of the motifs imposed on a linear representation of protein secondary structure are also provided. Summary information for the motifs is available, as are histograms of amino acid distribution, and graphs of dihedral angles at individual positions in the motifs.
Motivated Proteins is a publicly and freely accessible web application that enables protein scientists to study small three-dimensional motifs without requiring knowledge of either Structured Query Language or the underlying database schema.
Understanding of the diverse three-dimensional structures of proteins is aided by the recognition of their structural components. The most well-known of these are secondary structure elements, such as α-helix and β-sheet, and the super-secondary structures that can arise from them. Other, smaller, components are also abundant. The first such example was the β-turn , which exhibits geometrical constraints at certain of its four residues, and, like secondary structure, is stabilized by hydrogen bonding between peptide bond atoms – in this case a single hydrogen bond. β-turns are structural components in their own right, as they can be defined in terms of their dihedral angles and hydrogen-bond, in the absence of any knowledge of the secondary structure. Recognition of other such abundant small hydrogen-bonded three-dimensional motifs in proteins followed, including the β-bulge , the β-bulge loop , the αβ-motif  and the Schellmann loop [5, 6]. These motifs vary in length from three to seven residues, include one or more hydrogen bonds, and are generally associated with secondary-structure features.
Analogous structures to the β-turn occur (so-called side-chain/main-chain mimics) in which the hydrogen bond is between the main-chain NH atom and the side-chain oxygen atom of aspartate or asparagine (asx-turns) or serine or threonine (ST-turns) [7, 8]. Other frequently occurring motifs involving side-chain hydrogen bonds were identified: asx-motifs , ST-motifs  and ST-staples . There are also some abundant small motifs which involve the interaction of pairs of main-chain NH or CO groups – often by hydrogen bonding – with cationic or anionic groups, respectively. Examples of these are the nest  and the niche .
We have constructed a relational database of these motifs that can be interrogated using Structured Query Language (SQL). To allow protein scientists who may be unfamiliar with SQL to access this database. we have built an associated web application, entitled 'Motivated Proteins'. This web application allows results to be visualized in a variety of ways, most importantly in the context of the three-dimensional structure of the protein. It is designed to facilitate specific queries from protein scientists whose focus is a particular protein or motif, but also lets protein scientists without such a focus explore this area of protein structure.
Construction and content
Choice and definition of motifs
The database currently includes the twelve motifs mentioned in the Background section, above, using the criterion for inclusion that at least 2% of the amino acid residues in proteins belong to a particular motif. These twelve categories are divided into a total of 48 sub-categories (Additional file 1) on the basis of certain features. These features include specific variations in length (e.g. Schellman loops can be seven or eight residues in length), defining amino acid side-chain (e.g. S or T for S/T turns) and, in the specific case of β-bulges, whether the non-contiguous hydrogen-bonding partner is on the N-terminal side of the pair. In addition, different defining dihedral angles in the sub-categories may arise in two ways. The first is where there are alternative forms of a motif produced by peptide-plane flipping . The second is where there is an alternative enantiomeric form of the backbone of certain of the residues. (These give rise to the 'Flipped' and 'Reflected' attributes, respectively, in Additional file 1).
Database design, implementation, and population
The starting point for generating the tables of protein data was a set of 500 PDB (Protein Data Bank) files prepared by the Richardson laboratory . The reason for using these files was that the coordinates are of high quality, they include hydrogen atoms and corrected side-chain amide atom positions, and have been edited so that in oligomers only one subunit is represented. Some further editing of these files was necessary: where alternative conformations are listed for individual residues, only the first was retained. Of the 500 proteins, 417 were included in the database, supplemented by twelve from the PDB chosen to broaden the coverage of protein folds (Additional file 4).
In Fig. 1 it can be seen that the edited PDB files were the source of the coordinate data in the 'Atom' and 'HetAtom' entities. Processing the PDB files with the program HBPlus  generated the hydrogen-bond data for the 'HydrogenBond' entity. Processing the PDB files with the program DSSP  generated the φ and ψ dihedral angles and the secondary structure designations of the Residue entity. Processing the PDB files with the program BBDEP  generated the χ1 and χ2 angles of the Residue entity.
Perl scripts were written to automate processing, but generation of some tables required manual intervention. Population of the table for the 'Ligand' entity (Fig. 1) required subjective assessment of the functional relevance of the entries in the 'HetAtom' table (corresponding to 'HETATM' lines in the PDB file). The data for the table describing the 'Protein' entity were prepared by hand to allow inclusion of EC (Enzyme Commission) numbers, and to allow consistency in nomenclature for indexing. Active site data for the table describing the 'Residue' entity were obtained by consulting the Catalytic Site Atlas at EBI, Cambridge , and added manually.
SQL queries were written to harvest each sub-category of motif from this initial database, and Perl scripts written to automate the generation of the table describing the 'Motif' entity and the 'ResidueOfMotif' relationship entity from these queries (Fig. 1). Some of the type I β-turns obtained in this manner qualify as such only because they are parts of α-helices or 310 helices. These were removed from the database after running SQL queries to identify them. The 48 motif-defining SQL queries are provided as Additional file 5. The definitions of motif sub-categories are given in a Motif Glossary within the web application. Although a specific query is provided in the Motivated Proteins web application for searching for instances of any these 48 sub-categories, simplified motif menus of the twelve main categories are employed in other queries (see Utility section, below).
The database is implemented in the MySQL relational database management system (version 5.0.41) and has been deployed on servers variously running the Solaris, Linux or Mac OS X operating systems. We refer to the database as the 'Protein Motif Database', a name which distinguishes it from applications we have written that provide access to it, including the Motivated Protein web application which is described below.
Construction of the Motivated Proteins web application
We have used Java servlet technology to provide web access to the Protein Motif database, the servlet currently running in a Sun Java System Web Server (version 7.0) on the same machine as the Protein Motif database. It generates the main XHTML query pages (level 1.0 Strict) with which the user interacts (Utility section).
In servlet-based web applications, new pages are generated in a linear manner as a result of essentially form-based queries. In Motivated Proteins some such queries populate menus on resulting pages from which the user makes choices to formulate a scientific query which, in turn, returns a page containing the results. This latter is furnished with a form from which further queries can be made. Where alternative views of data – or supplementary information – are invoked by the user, they have been taken out of this linear query stream as small 'pop-up' web pages, generated by CGI applications written in Perl.
Population of alphabetical protein indexes was done dynamically using AJAX and a separate Java servlet.
Three of this first group of queries involve specifying a protein and selecting a specific category of motif (Fig. 2b and 2c). Depending on the query type, one obtains all instances of the chosen motif in the protein, those within 4Å of a ligand, or those which include enzyme active-site residues. The fourth query allows searches for the occurrence of a short amino acid sequence string within a specific motif or all motifs. In the latter case the number of 'hits' displayed can be restricted.
The first query in the second group allows searches for overlaps between two types of motif. In this case there are two drop-down menus of motifs, the second with an 'All' option, which the user is advised to employ in an initial screening step because the large number of possible combinations will generally include many that are not represented in any individual protein. The second query allows the user to retrieve instances of any of the 48 sub-categories of motif. This is useful for systematic work, especially when one wants to locate examples of less abundant motifs. (One can make a preliminary summary query – below – to determine the abundance of the different sub-categories.)
Presentation of the results of non-summary queries
The initial results of the queries described above are presented as tables in a page which the user can print or save (Fig. 2d). The XHTML-compliance of the pages allows them to be parsed as XML. However, as the XHTML tends to be rather extensive, links to other machine-readable textual options are provided (Fig. 2d). One of these is an easily-parsable plain-text format and the other is custom XML for which a DTD (Document Type Definition) has been created.
For the queries which find individual or overlapping motifs in a protein, there is the option (labelled '2D' – Fig. 2d) of viewing the motifs in the context of a graphic of the primary structure of the protein showing a simple cartoon representation of the secondary structure (Fig. 3c). An original Perl script (SecondGlance) is used to generate these graphics, which are based on those of the Wirplot diagram .
Three different types of summary query are provided. The 'General Summary' section provides access to tables listing the number of each sub-category of motif, and the number of each category of motif in the proximity of a ligand or including an active site residue. These tables are populated dynamically from the database. For overlapping motifs, the main overlapping partners for each motif are listed in a text page. The 'AA Frequencies' section provides histograms for the occurrence of the amino acids at each position of a motif sub-category. The 'Dihedral Angles' section allows the user to generate and view φ/ψ plots for each position in each sub-category of a selected motif (Fig. 3d).
A point of particular concern in designing Motivated Proteins was to avoid placing the user in situations which might dispose him to abandon the web application unnecessarily. This will be discussed in the context of two general ways in which we envisage the resource being used.
The first way in which we envisage the resource being used is by a protein scientist with a focus on a particular protein or motif. A potential problem here is that a protein of interest might not be present in the database, given that it was constrained in size for the reasons described in the Construction and Content section, above. The application is designed so that under such circumstances an external call is made to the CATH facility at University College, London  to retrieve the CATH structural classification code of the query protein. A search is then made of the local database for the proteins with CATH codes closest to this, and these are presented as options to the user. As the Protein Motif Database provides coverage of the first two levels (Class and Architecture) of the CATH classification there is a good chance that a structurally related protein will be found. In the event that the query protein has not received a CATH classification, an external call is made to the PDBSum SearchHeaders.pl facility at EBI, Cambridge, and functionally related alternatives are offered.
The second way in which we envisage the resource being used is by a protein scientist who wishes to explore these structural motifs, without having a particular protein or motif in mind. A potential problem here is the need to specify the PDB identifier of a protein example. For this reason an alphabetic index of the names of proteins in the database is provided (Fig. 2b). Selecting a letter of the alphabet invokes a floating list of corresponding protein names and PDB identifiers, and clicking on one of the latter enters it in the search field (Fig. 2c). (This index is context-sensitive – if one is searching for motifs near a ligand, for example, only those proteins with a ligand are included.) Alternatively, a keyword search can be performed to find proteins in the database answering a specific description (Fig. 2b–d).
In both cases considered above it can happen that the user makes a query for a motif, only to find that there are no instances of that motif in the protein selected. For this reason the pull-down menu for selecting motifs has an option, 'All' (Fig. 2c), which, on running, returns a listing of the number of motifs of each category in the specified protein. This listing can then form the basis for fruitful queries on the protein. If the focus is a particular motif, the user is able to employ the 'Specific Motifs' menu option.
We believe that the public availability of Motivated Proteins will assist scientific research on small hydrogen-bonded three-dimensional motifs within proteins, and hope that it may also lead to a greater appreciation of the occurrence and potential importance of such motifs.
Availability and requirements
The URI of the Motivated Proteins site is http://motif.gla.ac.uk/, with direct access to the web application at http://motif.gla.ac.uk/motif/index.html. The web application is publicly and freely accessible, requiring no registration and with no restrictions on use. All server scripts and Java source code supporting the web application are available, on request, under the GNU General Public License.
Several M. Res. Bioinformatics students made technical contributions to this work during 15-week summer projects: Fang Chen implemented the initial database and servlet, Simon Harding wrote the initial Perl scripts to automate population of the database with protein data, Christopher Tindal wrote the Perl scripts to harvest motifs from the database, Suraj Menon was responsible for the initial three-dimensional interface, Meng-Pin Weng wrote the initial index and keyword search functions, Shrikant Sharma implemented the index in AJAX, and Chintan Vora implemented the CATH-lookup features.
We acknowledge the use of the CATH-lookup facility at University College, London, and thank Ian Sillitoe for his help in this respect. We also acknowledge the use of the PDBSum SearchHeaders.pl facility at EBI Cambridge, and thank Roman Laskowski for his cooperation in establishing it. We thank the Jmol development team for advice, and especially Bob Hansen for introducing new features that we requested specifically for the Motivated Proteins web application.
DPL thanks the Faculty of Biomedical and Life Sciences for providing computing facilities, Ian Walker of the University of Glasgow Computing Service for invaluable server support, and Sung-Hee Park and Gary Gray for advice. We also thank Pawel Herzyk and Adrian Lapthorne for suggesting features to include in the web application.
- Venkatachalam CM: Stereochemical criteria for polypeptides and proteins. V. Conformation of a system of 3 linked peptide units. Biopolymers 1968, 6: 1425–1436. 10.1002/bip.1968.360061006View ArticlePubMedGoogle Scholar
- Richardson JS, Getzoff ED, Richardson DC: The beta bulge: a common small unit of nonrepetitive protein structure. Proc Natl Acad Sci USA 1978, 75(6):2574–2578. 10.1073/pnas.75.6.2574PubMed CentralView ArticlePubMedGoogle Scholar
- Milner-White EJ: Beta-bulges within loops as recurring features of protein structure. Biochim Biophys Acta 1987, 911(2):261–265.View ArticlePubMedGoogle Scholar
- Baker EN, Hubbard RE: Hydrogen bonding in globular proteins. Prog Biophys Mol Biol 1984, 44(2):97–179. 10.1016/0079-6107(84)90007-5View ArticlePubMedGoogle Scholar
- Schellmann JA: The alphaL conformation at the ends of apha-helix. In Protein Folding. Edited by: Jaenicke R. Amsterdam: Elsevier; 1980:53–61.Google Scholar
- Milner-White EJ: Recurring loop motif in proteins that occurs in right-handed and left-handed forms. Its relationship with alpha-helices and beta-bulge loops. J Mol Biol 1988, 199(3):503–511. 10.1016/0022-2836(88)90621-3View ArticlePubMedGoogle Scholar
- Eswar N, Ramakrishnan C: Secondary structures without backbone: an analysis of backbone mimicry by polar side chains in protein structures. Protein Eng 1999, 12(6):447–455. 10.1093/protein/12.6.447View ArticlePubMedGoogle Scholar
- Duddy WJ, Nissink JWM, Allen FH, Milner-White EJ: Mimicry by asx- and ST-turns of the four main types of beta-turn in proteins. Protein Science 2004, 13: 3051–3055. 10.1110/ps.04920904PubMed CentralView ArticlePubMedGoogle Scholar
- Wan W-Y, Milner-White EJ: A Natural Grouping of Motifs with an Aspartate or Asparagine Residue Forming Two Hydrogen Bonds to Residues Ahead in Sequence: Their Occurrence at α-Helical N Termini and in Other Situations. Journal of Molecular Biology 1999, 286: 1633–1649. 10.1006/jmbi.1999.2552View ArticlePubMedGoogle Scholar
- Wan W-Y, Milner-White EJ: A Recurring Two-Hydrogen-bond Motif Incorporating A Serine or Threonine Residue is found both at α-Helical N Termini and in Other Situations. Journal of Molecular Biology 1999, 286: 1651–1662. 10.1006/jmbi.1999.2551View ArticlePubMedGoogle Scholar
- Ballesteros JA, Deupi X, Olivella M, Haaksma EE, Pardo L: Serine and threonine residues bend alpha-helices in the chi(1) = g(-) conformation. Biophys J 2000, 79(5):2754–2760. 10.1016/S0006-3495(00)76514-3PubMed CentralView ArticlePubMedGoogle Scholar
- Watson JD, Milner-White EJ: A Novel Main-chain Anion-binding Site in Proteins: The Nest. A Particular Combination of Π,Ψ Values in Successive Residues Gives Rise to Anion-binding Sites That Occur Commonly And Are Found Often at Functionally Important Regions. Journal of Molecular Biology 2002, 315: 171–182. 10.1006/jmbi.2001.5227View ArticlePubMedGoogle Scholar
- Torrance GM, Leader DP, Gilbert DR, Milner-White EJ: A novel main chain motif in proteins bridged by cationic groups: the niche. J Mol Biol 2009, 385(4):1076–1086. 10.1016/j.jmb.2008.11.007View ArticlePubMedGoogle Scholar
- Hayward S: Peptide-plane flipping in proteins. Protein Sci 2001, 10(11):2219–2227. 10.1110/ps.23101PubMed CentralView ArticlePubMedGoogle Scholar
- Golovin A, Oldfield TJ, Tate JG, Velankar S, Barton GJ, Boutselakis H, Dimitropoulos D, Fillon J, Hussain A, Ionides JM, et al.: E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res 2004, (32 Database):D211–216. 10.1093/nar/gkh078
- Prlic A, Down TA, Hubbard TJ: Adding Some SPICE to DAS. Bioinformatics 2005, 21(Suppl 2):ii40-ii41. 10.1093/bioinformatics/bti1106PubMed CentralView ArticlePubMedGoogle Scholar
- Lovell SC, Davis IW, Arendall WB 3rd, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC: Structure validation by Cα geometry: ϕ, φ and Cβ deviation. Proteins 2003, 50(3):437–450. 10.1002/prot.10286View ArticlePubMedGoogle Scholar
- McDonald IK, Thornton JM: Satisfying hydrogen bonding potential in proteins. J Mol Biol 1994, 238(5):777–793. 10.1006/jmbi.1994.1334View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 1983, 2: 2577–2637. 10.1002/bip.360221211View ArticleGoogle Scholar
- Dunbrack RL Jr, Karplus M: Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J Mol Biol 1993, 230(2):543–574. 10.1006/jmbi.1993.1170View ArticlePubMedGoogle Scholar
- Porter CT, Bartlett GJ, Thornton JM: The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 2004, (32 Database):D129–133. 10.1093/nar/gkh028
- Herráez A: Biomolecules in the computer: Jmol to the rescue. Biochemistry and Molecular Biology Education 2006, 34(4):255–261. 10.1002/bmb.2006.494034042644View ArticlePubMedGoogle Scholar
- Lincoln D. Stein/GD[http://search.cpan.org/dist/GD/]
- 24. Laskowski RA, Chistyakov VV, Thornton JM: PDBsum more: new summaries and analyses of the known 3D structures of proteins and nucleic acids. Nucleic Acids Res 2005, (33 Database):D266â€“268.Google Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH – a hierarchic classification of protein domain structures. Structure 1997, 5(8):1093–1108. 10.1016/S0969-2126(97)00260-8View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.