Variation in structural location and amino acid conservation of functional sites in protein domain families
© Pils et al; licensee BioMed Central Ltd. 2005
Received: 24 May 2005
Accepted: 25 August 2005
Published: 25 August 2005
The functional sites of a protein present important information for determining its cellular function and are fundamental in drug design. Accordingly, accurate methods for the prediction of functional sites are of immense value. Most available methods are based on a set of homologous sequences and structural or evolutionary information, and assume that functional sites are more conserved than the average. In the analysis presented here, we have investigated the conservation of location and type of amino acids at functional sites, and compared the behaviour of functional sites between different protein domains.
Functional sites were extracted from experimentally determined structural complexes from the Protein Data Bank harbouring a conserved protein domain from the SMART database. In general, functional (i.e. interacting) sites whose location is more highly conserved are also more conserved in their type of amino acid. However, even highly conserved functional sites can present a wide spectrum of amino acids. The degree of conservation strongly depends on the function of the protein domain and ranges from highly conserved in location and amino acid to very variable. Differentiation by binding partner shows that ion binding sites tend to be more conserved than functional sites binding peptides or nucleotides.
The results gained by this analysis will help improve the accuracy of functional site prediction and facilitate the characterization of unknown protein sequences.
Protein function is determined by the spatial configuration and type of amino acids at functional sites. Knowledge of functional sites provides valuable information for the assignment of molecular function, potential physiological binding partners and hence drug design. Tasks performed by functional sites range from the binding of small molecules like ions, cofactors, metabolic substrates or high molecular weight compounds such as nucleic acids and peptide chains, to catalysing chemical reactions in the active centre of enzymes.
The exponentially growing number of uncharacterised protein sequences in the public databases has turned the development of automatic identification of functional sites into an important research field and many computational methods focusing on this area have been described in recent years (for review see [1–3]). In contrast to structural approaches that search for ligand binding pockets on the protein surface using molecular modelling [4, 5], network analysis , or compare the protein surface to structures with known interacting sites [7, 8], many methods are based on a set of homologous sequences combined with evolutionary or structural information. The evolutionary trace (ET) method [9, 10], for example, searches for a structural cluster of conserved residues. Beginning with a sequence identity tree from a set of homologous proteins, the tree is scanned for subgroup-specific residues, which are invariant within the subgroup but vary between subgroups. These residues, called evolutionary trace residues, and the residues that are invariant in all sequences are then mapped onto a representative 3D structure and clusters of high ranking residues, corresponding to the inner nodes of the tree, are searched. These clusters usually coincide with the functional center of the protein. Improvements of the ET method use sequence weights based on their similarity (weighted evolutionary tracing) and an amino acid substitution matrix to account for biochemically similar amino acids in the identification of the trace residues , they consider the evolutionary distance between proteins due to the phylogenetically biased databases  or allow different rates of amino acid substitutions at protein sites . A similar approach is focusing more on structural information and calculates a conservation score at each position under consideration of the behaviour of spatial neighbours .
Most of the above mentioned methods assume that functional sites are under high selective pressure and conserved within the protein, so that functional sites can easily be detected by lower rates of amino acid substitutions. However, functional sites can vary in subfamilies and homologous protein sequences can perform different functions using a different set of functional residues. Accordingly, interaction interfaces can vary in their location in distant homologues and this has to be considered if interaction interfaces are inferred from homologous proteins [15, 16].
Prior to their prediction, it is necessary to understand the arrangement and properties of functional sites, as well as how protein families and single sequences differ in their use. Effort towards this direction has been made by several groups, who studied physicochemical properties of protein-protein interaction interfaces found in homodimer, heterodimer or intra-chain domain complexes [17–21], as well as protein-DNA interaction interfaces .
Here, we perform a large-scale analysis of functional sites extracted from experimentally determined protein-ligand complexes stored in the protein data bank (PDB) , grouped by the presence of conserved protein domains described in the SMART database . The classification into protein families enables us to find differences in amino acid conservation and use of specific locations for functional sites between the families. The analysis shows that domains vary strongly in the conservation of interacting sites and provides useful information for the prediction of interacting sites based on homologous protein sequences.
Results and discussion
Our analysis of interacting sites is based on conserved protein domains found in the structures of the protein data bank (PDB). Family sequence alignments of protein domains were retrieved from the SMART database and used to scan the protein sequences of the PDB with domain-specific Hidden Markov Models (HMMs). Wherever a domain was identified, all interactions between an amino acid belonging to this domain and any ligand were extracted and the position of the interacting amino acid was transferred onto the HMM consensus.
Table 1 lists the number of sequences with ligand interactions extracted from PDB and the number of interacting sites found in these sequences [see Additional file 1]. The HMM consensus was used as a reference sequence to be able to compare the location of interactions among different sequences of a domain family. The positions in the consensus sequence correspond to positions shared by most sequences of the domain family (match states) and most domains interact only in regions for which an HMM match state exists. In contrast, if the interacting position corresponds to an insert state in the alignment to the HMM consensus, the information of interaction could not be transferred to the domain consensus. There are several domains, which have a comparatively large number of interacting sites located in loop regions. These amino acid sites are only present in subfamilies of the domain and are lacking an HMM match state, so that mapping the information of interaction onto the HMM was not possible. An overview of the number of interactions at HMM insert states for all investigated protein domains is given in table 1 [see Additional file 1]. In domains of the immunological system, e.g. the immunoglobulin (IG) and immunoglobulin V type (IGv) domains, up to 50% of the interacting sites are located in these regions. Strikingly, other extra-cellular domains (fibronectin type 3 (FN3), C-type lectin (CLECT), leucine rich repeat C-terminal domain (LRRCT)) also present a large number of interacting sites in loop regions. These domains all have a common involvement in highly specific recognition processes, where any restrictions in the choice of amino acid or structural constraints would be disadvantageous for the function of the domain. The loop regions without any match states exactly fulfil this condition and therefore biochemical properties of the domain can be fine-tuned to complement the appearance of the ligand. The increased use of variable loop regions for functional sites seems to be a common characteristic of extracellular highly ligand-specific domains.
Conservation of location of interacting sites
In order to compare the use of specific interacting positions within a domain family we introduced a score (ConsInt) to measure the conservation of interaction at a given site in the family alignment. The score reflects the importance of an amino acid site in domain interactions and ranges between 0 and 1. Scores greater than 0 mean that at least two non-identical sequences interact at this position, and a score of 1 arises if all sequences interact at this position. For the comparison of the interaction scores and amino acid conservation, we only calculated the score for sites, where the interaction is achieved by atoms of the amino acid side chain. These scores are generally smaller than those calculated for side chain and backbone interactions, because side chain interactions are only part of the total interactions and many sites are very flexible in the contribution of atoms to the interaction interface. Corresponding positions in homologous sequences can interact with side chain atoms in one sequence and exclusively with backbone atoms in the next sequence.
Amino acid conservation of interacting sites
Correlation of interaction and amino acid conservation
Our analysis reveals that functional sites can be highly variable in their amino acid conservation and very flexible in using various locations in the protein domain. The properties of functional sites are dependent on the protein family and can vary from highly conserved, as observed in enzymes involved in DNA replication, to protein families that are highly variable with various amino acids at various locations, as for example immunoglobulins or carbohydrate-binding domains. Similar results were obtained by other groups. Pachenko and co-workers analyzed 86 domains from the CDD database and report that functional sites of homologous sequences can greatly differ in their physicochemical properties and their location in the three-dimensional structure . Variability in functional sites was also described by Devos and Valencia. By comparing the conservation of binding sites in structural alignments, they found high conservation in diverged sequences contrasting highly similar sequences with different interacting sites . Our findings present valuable information for the improvement of methods to predict functional sites. In most of these methods, the prediction is based on a set of homologous sequences. This approach results only in reliable predictions if the investigated protein family is conserved in most of the functional sites. Approaches to improve the accuracy of prediction have been made recently by using orthologous proteins with presumably the same ligand specificity  or by sub-typing protein families . With the growing number of sequences within protein families, trends that consider variability of functional sites and use subgrouping aided by experimental information might become widely accepted in the future and promise to be successful in the prediction of functional sites.
The analysis is based on the October version of PDB (27969 structures). All protein sequences extracted from PDB files were scanned against all SMART domains (667) using hmmsearch from the HMMER package (version 2.3.2) . Profile HMMs were retrieved from the SMART family alignments and score thresholds were used as assigned by SMART for each individual domain . The search resulted in 8747 protein structures containing at least one of 480 SMART domains. For those structures containing a protein-ligand interaction, we calculated the distance for each atom of all protein compounds to each atom of all ligands. We considered amino acids as interacting if the distance of any atom of the amino acid to any atom of the ligand was smaller than 4 Angstrom, which is a very conservative threshold and is in the range of the two oxygen atoms in a hydrogen bond. Interactions to water molecules were neglected, as well as interactions between identical chains, because homodimers can present artefacts arising from the crystallization process. We are aware of losing information about naturally occurring homodimers. If more than one model of the protein structure exists, for example in the case of NMR data, all models were taken into account and a position was treated as interacting if more than 50% of all models harbour an interaction at this position. For each domain family, a multiple sequence alignment was generated from the sequences identified in the PDB scan for SMART domains. The alignments were created with hmmalign (HMMER package), according to the SMART profile HMMs. Distance trees were created with PROTDIST and FITCH, both from the Phylip package .
A problem with using the protein data bank is the overrepresentation of some proteins, while others are completely absent. To deal with the biased nature of the database, we used sequence weights, correlated to the evolutionary distance between the sequences. The distance is small for similar sequences, so that these proteins are weighted with a negligible small factor. Although many proteins are redundant in our dataset it is advantageous to consider all proteins because they can be bound to different ligands or the complex crystallized under different conditions. The most important interacting sites should clearly stand out in the large scale analysis.
Ligands interacting with protein domains were, wherever possible, classified into groups of peptides, nucleotides or ions. The group of ions was restricted to small ions and excluded typical buffer anions. The groups of peptides and nucleotides also included modified molecules that are functionally alike. We also analysed all ligands together, including carbohydrates, buffer ions and other small molecules.
Calculation of scores
For each alignment position consistent with the HMM consensus sequence, we calculated a position-specific interaction conservation score (ConsInt) to describe how well a position is conserved in ligand interactions within the domain family.
where N is the number of sequences in the alignment, Dist(seqa, seqb) is the phylogenetic distance between sequence a and b obtained from the phylogenetic tree and Inti takes a value of 1 if sequence a and b both interact at position i, and 0 otherwise. The score ranges from 0 to 1 and is 1 if all sequences interact at the position of interest. To obtain a score greater than 0 for a certain position at least two non-identical sequences have to interact at this position. In this way, interactions to non-physiological ligands like artificially synthesized peptides should be lost, and important interacting positions should have noticeably higher scores. The score takes into account the phylogenetic distances between the sequences so that highly similar sequences are weighted more weakly, in contrast to interacting sites in divergent sequences, which are weighted more strongly.
Similar to the interaction score, a position specific amino acid conservation score (ConsAA) was calculated.
Here, Substi(a, b) measures the similarity between the amino acids at position i in sequences a and b. It is based on the VTML 160 substitution matrix . The conservation score was normalized between 0 (low aa conservation) and 1 (high aa conservation) ranging from the lowest to the highest score of the amino acid substitution matrix. It is important to note that our amino acid conservation scores do not describe protein families as found in the SMART database, but only the data set used in our analysis, so that the amino acid conservation score can be compared with the interaction conservation score for each position in the family alignment.
The software suite R was used for the statistical analyses . The concentration of data points on vertical lines found for higher amino acid conservation scores in the scatter plots of figure 6 are an effect of the discrete values of the amino acid substitution matrix and the few substitutions at conserved sites.
We would like to thank Ivica Letunic for scanning PDB for SMART domains and Wolfgang Huber for his assistance on the statistical analysis. This work was supported by the DFG grant BO-1099/5-2 (BP).
- Lichtarge O, Sowa ME: Evolutionary predictions of binding surfaces and interactions. Curr Opin Struct Biol 2002, 12: 21–27. 10.1016/S0959-440X(02)00284-1View ArticlePubMedGoogle Scholar
- Campbell SJ, Gold ND, Jackson RM, Westhead DR: Ligand binding: functional site location, similarity and docking. Curr Opin Struct Biol 2003, 13: 389–395. 10.1016/S0959-440X(03)00075-7View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Searching for functional sites in protein structures. Curr Opin Chem Biol 2004, 8: 3–7. 10.1016/j.cbpa.2003.11.001View ArticlePubMedGoogle Scholar
- Bhinge A, Chakrabarti P, Uthanumallian K, Bajaj K, Chakraborty K, Varadarajan R: Accurate detection of protein:ligand binding sites using molecular dynamics simulations. Structure (Camb) 2004, 12: 1989–1999. 10.1016/j.str.2004.09.005View ArticleGoogle Scholar
- Laurie AT, Jackson RM: Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21: 1908–1916. 10.1093/bioinformatics/bti315View ArticlePubMedGoogle Scholar
- Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, Venger I, Pietrokovski S: Network analysis of protein structures identifies functional residues. J Mol Biol 2004, 344: 1135–1146. 10.1016/j.jmb.2004.10.055View ArticlePubMedGoogle Scholar
- Stark A, Shkumatov A, Russell RB: Finding functional sites in structural genomics proteins. Structure (Camb) 2004, 12: 1405–1412. 10.1016/j.str.2004.05.012View ArticleGoogle Scholar
- Kinoshita K, Nakamura H: Identification of the ligand binding sites on the molecular surface of proteins. Protein Sci 2005, 14: 711–718. 10.1110/ps.041080105PubMed CentralView ArticlePubMedGoogle Scholar
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257: 342–358. 10.1006/jmbi.1996.0167View ArticlePubMedGoogle Scholar
- Madabushi S, Yao H, Marsh M, Kristensen DM, Philippi A, Sowa ME, Lichtarge O: Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J Mol Biol 2002, 316: 139–154. 10.1006/jmbi.2001.5327View ArticlePubMedGoogle Scholar
- Landgraf R, Fischer D, Eisenberg D: Analysis of heregulin symmetry by weighted evolutionary tracing. Protein Eng 1999, 12: 943–951. 10.1093/protein/12.11.943View ArticlePubMedGoogle Scholar
- Armon A, Graur D, Ben-Tal N: ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J Mol Biol 2001, 307: 447–463. 10.1006/jmbi.2000.4474View ArticlePubMedGoogle Scholar
- Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N: Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 2002, 18 Suppl 1: S71–7.View ArticlePubMedGoogle Scholar
- Landgraf R, Xenarios I, Eisenberg D: Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 2001, 307: 1487–1502. 10.1006/jmbi.2001.4540View ArticlePubMedGoogle Scholar
- Aloy P, Ceulemans H, Stark A, Russell RB: The relationship between sequence and interaction divergence in proteins. J Mol Biol 2003, 332: 989–998. 10.1016/j.jmb.2003.07.006View ArticlePubMedGoogle Scholar
- Rekha N, Machado SM, Narayanan C, Krupa A, Srinivasan N: Interaction interfaces of protein domains are not topologically equivalent across families within superfamilies: Implications for metabolic and signaling pathways. Proteins 2005, 58: 339–353. 10.1002/prot.20319View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. J Mol Biol 1997, 272: 121–132. 10.1006/jmbi.1997.1234View ArticlePubMedGoogle Scholar
- Lijnzaad P, Argos P: Hydrophobic patches on protein subunit interfaces: characteristics and prediction. Proteins 1997, 28: 333–343. 10.1002/(SICI)1097-0134(199707)28:3<333::AID-PROT4>3.0.CO;2-DView ArticlePubMedGoogle Scholar
- Larsen TA, Olson AJ, Goodsell DS: Morphology of protein-protein interfaces. Structure 1998, 6: 421–427. 10.1016/S0969-2126(98)00044-6View ArticlePubMedGoogle Scholar
- Lo Conte L, Chothia C, Janin J: The atomic structure of protein-protein recognition sites. J Mol Biol 1999, 285: 2177–2198. 10.1006/jmbi.1998.2439View ArticlePubMedGoogle Scholar
- Valdar WS, Thornton JM: Protein-protein interfaces: analysis of amino acid conservation in homodimers. Proteins 2001, 42: 108–124. 10.1002/1097-0134(20010101)42:1<108::AID-PROT110>3.0.CO;2-OView ArticlePubMedGoogle Scholar
- Luscombe NM, Thornton JM: Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity. J Mol Biol 2002, 320: 991–1009. 10.1016/S0022-2836(02)00571-5View ArticlePubMedGoogle Scholar
- Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, Zhang Q, Knezevich C, Xie L, Chen L, Feng Z, Green RK, Flippen-Anderson JL, Westbrook J, Berman HM, Bourne PE: The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res 2005, 33 Database Issue: D233–7.Google Scholar
- Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T, Schultz J, Ponting CP, Bork P: SMART 4.0: towards genomic data integration. Nucleic Acids Res 2004, 32: D142-D144. 10.1093/nar/gkh088PubMed CentralView ArticlePubMedGoogle Scholar
- Villafranca JE, Robertus JD: Ricin B chain is a product of gene duplication. J Biol Chem 1981, 256: 554–556.PubMedGoogle Scholar
- Rutenber E, Ready M, Robertus JD: Structure and evolution of ricin B chain. Nature 1987, 326: 624–626. 10.1038/326624a0View ArticlePubMedGoogle Scholar
- Werner MH, Huth JR, Gronenborn AM, Clore GM: Molecular basis of human 46X,Y sex reversal revealed from the three-dimensional solution structure of the human SRY-DNA complex. Cell 1995, 81: 705–714. 10.1016/0092-8674(95)90532-4View ArticlePubMedGoogle Scholar
- Ohndorf UM, Rould MA, He Q, Pabo CO, Lippard SJ: Basis for recognition of cisplatin-modified DNA by high-mobility-group proteins. Nature 1999, 399: 708–712. 10.1038/21460View ArticlePubMedGoogle Scholar
- Littler SJ, Hubbard SJ: Conservation of orientation and sequence in protein domain--domain interactions. J Mol Biol 2005, 345: 1265–1279. 10.1016/j.jmb.2004.11.011View ArticlePubMedGoogle Scholar
- Clackson T, Wells JA: A hot spot of binding energy in a hormone-receptor interface. Science 1995, 267: 383–386.View ArticlePubMedGoogle Scholar
- Keskin O, Ma B, Nussinov R: Hot regions in protein--protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol 2005, 345: 1281–1294. 10.1016/j.jmb.2004.10.077View ArticlePubMedGoogle Scholar
- Panchenko AR, Kondrashov F, Bryant S: Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci 2004, 13: 884–892. 10.1110/ps.03465504PubMed CentralView ArticlePubMedGoogle Scholar
- Devos D, Valencia A: Practical limits of function prediction. Proteins 2000, 41: 98–107. 10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-SView ArticlePubMedGoogle Scholar
- Mirny LA, Gelfand MS: Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J Mol Biol 2002, 321: 7–20. 10.1016/S0022-2836(02)00587-9View ArticlePubMedGoogle Scholar
- Hannenhalli SS, Russell RB: Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 2000, 303: 61–76. 10.1006/jmbi.2000.4036View ArticlePubMedGoogle Scholar
- Eddy S: HMMER: sequence analysis using profile hidden Markov models. [http://hmmerwustledu]
- Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 1998, 95: 5857–5864. 10.1073/pnas.95.11.5857PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.Google Scholar
- Muller T, Spang R, Vingron M: Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 2002, 19: 8–13.View ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. Vienna, Austria, R Foundation for Statistical Computing; 2004.Google Scholar
- Pascal JM, Day PJ, Monzingo AF, Ernst SR, Robertus JD, Iglesias R, Perez Y, Ferreras JM, Citores L, Girbes T: 2.8-A crystal structure of a nontoxic type-II ribosome-inactivating protein, ebulin l. Proteins 2001, 43: 319–326. 10.1002/prot.1043View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.