Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs)
© Wang et al.; licensee BioMed Central Ltd. 2013
Published: 28 February 2013
Skip to main content
© Wang et al.; licensee BioMed Central Ltd. 2013
Published: 28 February 2013
The prediction of biochemical function from the 3D structure of a protein has proved to be much more difficult than was originally foreseen. A reliable method to test the likelihood of putative annotations and to predict function from structure would add tremendous value to structural genomics data. We report on a new method, Structurally Aligned Local Sites of Activity (SALSA), for the prediction of biochemical function based on a local structural match at the predicted catalytic or binding site.
Implementation of the SALSA method is described. For the structural genomics protein PY01515 (PDB ID 2aqw) from Plasmodium yoelii, it is shown that the putative annotation, Orotidine 5'-monophosphate decarboxylase (OMPDC), is most likely correct. SALSA analysis of YP_001304206.1 (PDB ID 3h3l), a putative sugar hydrolase from Parabacteroides distasonis, shows that its active site does not bear close resemblance to any previously characterized member of its superfamily, the Concanavalin A-like lectins/glucanases. It is noted that three residues in the active site of the thermophilic beta-1,4-xylanase from Nonomuraea flexuosa (PDB ID 1m4w), Y78, E87, and E176, overlap with POOL-predicted residues of similar type, Y168, D153, and E232, in YP_001304206.1. The substrate recognition regions of the two proteins are rather different, suggesting that YP_001304206.1 is a new functional type within the superfamily. A structural genomics protein from Mycobacterium avium (PDB ID 3q1t) has been reported to be an enoyl-CoA hydratase (ECH), but SALSA analysis shows a poor match between the predicted residues for the SG protein and those of known ECHs. A better local structural match is obtained with Anabaena beta-diketone hydrolase (ABDH), a known β-diketone hydrolase from Cyanobacterium anabaena (PDB ID 2j5s). This suggests that the reported ECH function of the SG protein is incorrect and that it is more likely a β-diketone hydrolase.
A local site match provides a more compelling function prediction than that obtainable from a simple 3D structure match. The present method can confirm putative annotations, identify misannotation, and in some cases suggest a more probable annotation.
There are currently over 11,000 structural genomics (SG) protein structures in the Protein Data Bank (PDB)  and most of them are of unknown or uncertain function, as the inference of function from structure has proved to be more difficult than anticipated. Furthermore, when new structures of unknown function are determined, it is common practice to make a tentative functional assignment from the closest sequence match or the best 3D structure match to an annotated protein. Such tentative functional assignments are often incorrect . Furthermore, one annotation error can propagate or "percolate" [2–4] in databases as additional proteins are annotated by automated or semi-automated means.
Overviews of current methods for the functional annotation of proteins from their sequence and/or structure have been given in recent reviews [5–8]. The simplest, and most commonly employed  methods seek the closest sequence matches using a search program such as BLAST , or alternatively the closest 3D structure match obtained from e.g. Dali , Combinatorial Extension (CE) , or Topofit , and then just transfer the function from the closest match to the query protein. However, even relatively high sequence similarity does not necessarily imply similar function . Other types of sequence-based methods employ motif searching, phylogenetic profiling, or genome context. The Critical Assessment of Function Annotation (CAFA) experiment (http://biofunctionprediction.org/) seeks to assess the state of the current art of function prediction, chiefly from sequence. The aim of this work is to exploit structural information, together with computed chemical properties, to enhance function prediction capabilities.
It was hoped that SG would provide functional annotations for the protein products of newly-sequenced coding genes, as indeed the 3D structure can sometimes be indicative of function. Simple protein fold comparison does work in some cases, as domains having a common fold sometimes do have the same function. However, many folds have multiple functions. For instance, the Rossman fold and the TIM barrel each represent more than 50 different functions. The use of local 3D structural motifs or templates, a feature of the present method, is now emerging as a more promising path for correct functional annotation from structure [14–19].
In spite of recent advances in protein function prediction, inference of biochemical function from the structure is difficult [20, 21]. Hundreds of SG structures have no functional assignment at all and, for thousands of other SG proteins, functional hypotheses for SG proteins are putative and uncertain. Not all such hypotheses will prove in time to be correct, as examples below will illustrate. The ability to determine function from the 3D structure would add great value to this growing volume of SG data.
A different approach to functional annotation from 3D structure is presented here and is based on the combination of functional site prediction with local 3D structural alignment. Functional site predictions are obtained from Partial Order Optimum Likelihood (POOL) [22, 23], a monotonicity-constrained maximum likelihood method, using computed chemical, electrostatic, and geometric properties, as well as phylogenetic information (if available), as input features. POOL places all of the residues in the input protein structure into an ordered list, ranked according to probability of participation in the active site. The top-ranked residues constitute the active site prediction. Structural alignments are obtained for sets of these local sites. Characteristic spatial patterns of predicted residues at the structurally aligned local sites of activity (SALSAs) are then used to identify specific types of biochemical function. The quality of the match of the predicted functional site in the query protein to functional sites in proteins of known function is measured using a scoring function. The present method can determine whether a putative functional assignment is likely to be correct or incorrect. In some cases where a protein is shown to be misannotated, a probable functional assignment is made.
Functional residue predictions were made using POOL [22, 23]. Input features for each residue in a given structure include: electrostatics information, as contained in THEMATICS metrics [24, 25]; phylogenetic information from INTREPID [26, 27]; and geometric information from ConCavity (structure only version) . The top-ranked residues in the POOL output constitute the functional site prediction. Cut-off limits are specified for each case.
Multiple structure alignments are made for each set of proteins. The structural alignment of multiple structures of diverse function can be difficult and therefore multiple alignment methods [11, 12, 29] may be needed for some cases. In the examples shown here, T-Coffee  is used. For present purposes, a full alignment is not necessary. A quality alignment is only required in the local spatial region of the predicted active site.
SALSA tables are constructed for the locally aligned residues in the predicted active site. In a SALSA table, the rows represent individual protein structures and the columns represent spatially aligned positions.
Consensus signatures for a given functional subclass are established using POOL predictions on a set of previously characterized proteins with the same biochemical function, usually with common fold. To maximize sequence diversity in this reference set, sets of structures are sought with the lowest possible sequence identity among them. POOL-predicted residues of the same amino acid type in the same spatial position for the majority of the previously characterized proteins of common biochemical function then constitute the consensus signature for that functional group. The consensus signature for a given biochemical function thus consists of a series of amino acid types in specified spatial positions.
SG proteins of unknown or uncertain function are analyzed by POOL and the predictions are aligned with those of proteins of known function, or with the consensus signature.
Scoring the match between the predicted active site for the query protein and that of the consensus signature is performed using the BLOSUM62 matrix . Scores are reported as a percentage of the maximum value (i.e. the score for the perfect match, the consensus signature with itself).
Orotidine 5'-monophosphate decarboxylase (OMPDC) catalyzes one step in the pyrimidine biosynthesis pathway. It catalyzes the metal ion dependent decarboxylation of orotidine monophosphate (OMP) to uridine monophosphate (UMP) and CO2 [31, 32]. OMPDC is a member of the ribulose phosphate binding barrel (RPBB) superfamily and has a TIM barrel  structure, with the active site located inside the beta barrel, spanning the eight beta strands. The structural genomics protein PY01515 (PDB ID 2aqw) is a putative OMPDC from Plasmodium yoelii .
Sequence identity matrix for five previously characterized OMPDCs (structures 1-5) and the SG protein PY01515 (PDB ID 2aqw).
Local structural alignment of the consensus signature residues for the OMPDCs.
Structurally aligned signature active site residues for OMPDC
The quality of a match with the consensus signature may be measured using a scoring matrix. Using the BLOSUM62  matrix, the first four proteins listed in Table 2 have a score of 48 with the consensus signature; this score is 100% of the maximum value. The Plasmodium falciparum structure has a score of 39 (81% of the maximum value) against the consensus signature.
The structurally aligned residues for the SG protein PY01515 from Plasmodium yoelii are shown in the last row of Table 2. For seven out of the eight positions, POOL predicts residues that are identical to the consensus signature residues of the previously characterized OMPDCs. The only variation is in position 6, where there is an asparagine that is not predicted by POOL, just as in the Plasmodium falciparum OMPDC. PY01515 has a score of 39 (81% of the maximum value) against the consensus signature, using the BLOSUM62 scoring matrix. The strong match between the predicted active site for PY01515 and those of the previously characterized OMPDCs indicates that the putative OMPDC functional assignment is correct.
YP_001304206.1 (PDB ID 3h3l) is a putative sugar hydrolase from Parabacteroides distasonis, a commensal bacterium of the human intestinal tract. YP_001304206.1 is a member of the Concanavalin A-like lectins/glucanases superfamily.
Local structural alignment of the residues in the GH16 consensus signature positions for the known representative GH16, endo-1,3-1,4-beta-D-glucan 4-glucanohydrolase, with the SG protein YP_001304206.1.
Local structural alignment of the predicted active site residues by SALSA for a known ECH from Rattus norvegicus (PDB ID 1ey3) with predicted residues for a Structural Genomics protein from Mycobacterium avium (PDB ID 3q1t), reported to be an ECH.
Known ECH (1ey3)
SG protein "ECH" (3q1t)
Local structural alignment of the predicted residues for the SG protein from Mycobacterium avium (PDB ID 3q1t) with the corresponding residues of ABDH from Cyanobacterium anabaena.
SG protein "ECH" (3q1t)
Known ABDH (2j5s)
Local structural matching, as implemented by the SALSA method, provides a more compelling prediction of biochemical function than a simple, global 3D structure match. SALSA can confirm putative annotations, identify misannotations, suggest correct annotations, and, in some cases of misannotation, predict a more probable functional annotation.
For any given protein structure of previously characterized function, the list of residues reported in the literature to be important for the biochemical function is a subset of the list of residues predicted by POOL. This longer list is a key advantage of the present method, as it enables better discrimination between the functional subclasses.
To date, one prediction made by local site matching using our electrostatics-based functional site prediction has been verified experimentally by direct biochemical assays . Further experimental testing of SALSA function predictions is in progress.
The BLOSUM62 scoring matrix has been used to measure the quality of the match between two predicted active sites. Whether there exists a better scoring matrix for this purpose is currently under investigation. At the present time, there are too few SG proteins with experimentally verified biochemical function to be able to translate the match score into a confidence metric, but as experimental testing progresses, this will become possible.
The SALSA method is amenable to automation and could be used to complement sequence-based function annotation methods, such as those evaluated in the CAFA experiments.
ZW, PY, JSL, and RP are doctoral candidates in the Department of Chemistry and Chemical Biology at Northeastern University. SS earned the Ph.D. degree in Chemistry from Northeastern University in 2011 and is currently engaged in postdoctoral research at Yale University. MJO is Professor of Chemistry and Chemical Biology and is Principal Investigator of the Computational Biology Research Group at Northeastern University.
Anabaena beta-diketone hydrolase
BLOcks of amino acid SUbstitution Matrix
Critical Assessment of Function Annotation
glycoside hydrolase family 16
INformation-theoretic TREe traversal for Protein functional site Identification
orotidine 5';-monophosphate decarboxylase
Protein Data Bank
Partial Order Optimum Likelihood
Ribulose Phosphate Binding Barrel
Structurally Aligned Local Sites of Activity
THEoretical Microscopic Anomalous TItration Curve Shapes
The support of the NSF under grants number MCB-0843603 and MCB-1158176 is gratefully acknowledged. JSL is an NSF Graduate Research Fellow.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 3, 2013: Proceedings of Automated Function Prediction SIG 2011 featuring the CAFA Challenge: Critical Assessment of Function Annotations. The full contents of the supplement are available online at URL. http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.