Sequence variation in ligand binding sites in proteins

Magliery, Thomas J; Regan, Lynne

doi:10.1186/1471-2105-6-240

Research article
Open access
Published: 30 September 2005

Sequence variation in ligand binding sites in proteins

Thomas J Magliery^1,2 &
Lynne Regan^1,3

BMC Bioinformatics volume 6, Article number: 240 (2005) Cite this article

8366 Accesses
65 Citations
3 Altmetric
Metrics details

Abstract

Background

The recent explosion in the availability of complete genome sequences has led to the cataloging of tens of thousands of new proteins and putative proteins. Many of these proteins can be structurally or functionally categorized from sequence conservation alone. In contrast, little attention has been given to the meaning of poorly-conserved sites in families of proteins, which are typically assumed to be of little structural or functional importance.

Results

Recently, using statistical free energy analysis of tetratricopeptide repeat (TPR) domains, we observed that positions in contact with peptide ligands are more variable than surface positions in general. Here we show that statistical analysis of TPRs, ankyrin repeats, Cys₂His₂ zinc fingers and PDZ domains accurately identifies specificity-determining positions by their sequence variation. Sequence variation is measured as deviation from a neutral reference state, and we present probabilistic and information theory formalisms that improve upon recently suggested methods such as statistical free energies and sequence entropies.

Conclusion

Sequence variation has been used to identify functionally-important residues in four selected protein families. With TPRs and ankyrin repeats, protein families that bind highly diverse ligands, the effect is so pronounced that sequence "hypervariation" alone can be used to predict ligand binding sites.

Background

The central challenge of the post-genomic era is to determine the structures and functions of thousands of newly-identified putative proteins [1, 2]. Elucidating how proteins carry out their functions in diverse contexts and in organisms from all three domains of life is both fundamentally important to understanding biological function and critical for engineering new functions into novel proteins. Sequence conservation alone can be used to structurally categorize many proteins or putative proteins [3]. Additionally, catalytic sites in enzymes can sometimes be identified from conserved surface motifs [4–8]. In contrast, sites with poor sequence conservation have been largely ignored, because they are assumed to be of little structural or functional importance [9].

Sequence alignment of proteins with similar structures has shown that as sequence identity increases, average backbone variation decreases [10]. Within a particular protein family, residues in the hydrophobic core are the most conserved, implying they play a key role in specifying the fold [11]. In contrast, solvent-exposed residues tend to be variable, with mutations having little deleterious effect on overall structure or stability [12]. Consequently, conservation of surface residues is commonly taken to be an indication of functional importance [13]. This idea can be used to identify active-site residues from a collection of proteins that perform the same function, but it is not applicable to families of proteins that use a common scaffold to bind diverse ligands. Rather, we might hypothesize that such binding sites will be composed of positions that are variable.

Recently, we used a statistical free energy (SFE) approach [14, 15] to understand better the role of conserved residues in defining tetratricopeptide repeat (TPR) motifs, which are thought to commonly mediate protein-protein interactions [16]. Strikingly, we found that the ligand-binding site of the motif can readily be identified by sequence hypervariation of positions proximal to the ligand, as evidenced by very low statistical free energies separating those positions from a position-independent reference state. Here, we examine this observation in more detail and demonstrate that specificity-determining residues in TPRs, ankyrin (Ank) repeats, Cys₂His₂ zinc fingers (Zifs), and PDZ domains can be identified from sequence variation.

By analyzing protein families with exceptional biochemical and biophysical characterization, we show that, when the ligand repertoire is highly diverse for a particular family, the binding site can be identified from sequence hypervariation alone. However, even when the ligands have significant features in common, sequence variation can be used to "dissect" binding sites to identify specificity-determining residues. We demonstrate this sequence variation using probabilistic and information theory approaches closely tied to the mathematics of covariation, which are more suitable for this type of analysis than SFEs or Shannon entropies. Statistical identification of specificity-determining residues will greatly facilitate the engineering proteins with novel functions and targets.

Results & discussion

The TPR binding site

The TPR is a common 34 amino-acid protein motif that occurs in arrays, most frequently with three contiguous repeats [17]. Although TPR domains are thought to mediate protein-protein interactions, only a few examples have been well characterized. The large number of known TPR sequences, nearly 10,000 in Pfam [18], makes this motif an excellent target for statistical analysis. Using several mathematical approaches, we calculated the separation of the amino acid distribution at each position in TPRs from a position-independent reference state, amino acid usage in all proteins in yeast (Figure 1a). Note that we have performed this calculation on all of the 34 positions in the TPR motif; TPR domains are made up of tandem repeats of the TPR motif.

When relative entropy values are mapped onto the ligand-bound co-crystal structures of two different three-TPR domains (the TPR1 and TPR2A domains from Hsp-Organizing Protein, HOP [19]), it is immediately apparent that the concave, peptide-binding face of the TPR domains is more variable (i.e., more like the reference state) than the convex, solvent-exposed face (Figure 1b and 1c). For clarity, the concave surfaces in Figure 1d are depicted with coloration of only ligand-binding residues independently identified from crystallographic analysis [19]. For both TPR domains, every residue in contact with the ligand peptide is in a position with a small relative entropy (blue or green in the figures), indicating small differences from the reference distribution.

This hypervariation is a consequence of each TPR having evolved to bind a different specific ligand (or portion of a ligand). When TPR proteins are considered collectively, the binding positions are statistically randomized to an extent that is dictated by the repertoire of amino acids required to perform the range of binding functions. In contrast, solvent-exposed residues in general mutate in a stochastic fashion throughout evolution, slowly reverting or "drifting" toward a "neutral" distribution. The lack of structural or functional importance of solvent-exposed residues results in little selective pressure against mutation, but the extent of randomization is limited by evolutionary time and subtle factors such as overall protein solubility. As a result of the high diversity of the ligands of TPR repeats, the binding surface is more variable than the solvent-exposed surface. Specifically, we predict that the ligand-binding residues that show the most sequence variation are the positions that determine ligand specificity.

Positions 2, 5, 9, 12, 13, 33 and 34 show the most sequence variation in TPRs (i.e., they have the lowest relative entropies, ≤ 0.30). These seven residues all lie on the same face of the motif, and they are the residues that are exposed on the concave face of TPR domains. In fact, the TPR-peptide co-crystal structures show that residues in positions 2, 5, 6, 9, 12, and 13 are used by HOP TPR1 and TPR2A to bind their ligands [19]. Since few TPR-ligand structures have been solved, it remains to be seen whether or not other TPR domains utilize positions 33 and 34. Their spatial proximity to the other binding residues suggests that this is likely.

It is not surprising that some of the positions in contact with the ligand peptides are more biased than others (such as position 6). Some positions used for binding may also have other restrictions (such as structural restrictions) that limit the repertoire of amino acids allowed, and some positions may be important for binding affinity but not for specificity (which is to say they may bind a feature that is common among all ligands). For example, position 6 is modestly conserved in TPRs overall (it is frequently Asn) and is more buried than the other binding positions. The position-6 residues make contacts to the backbone of the ligand peptides here. It is worth noting that a position-6 Asn in the PEX5 C-terminal TPR domain also appears to make contact to the peptide backbone of the unrelated peroxisomal targeting signal peptide [20].

Several of the spheres in TPR2A are grey, indicating that they correspond to non-canonical positions and were therefore not calculated. At present, our analysis does not consider the effect of insertions and deletions. In the future, one could imagine including "deletion" as another "amino acid," so that site occupancy would contribute to the variation score.

Measuring sequence variation

The use of metrics that measure the difference from a position-independent reference distribution is key to our observation, because (1) it is not clear how mere lack of conservation is related to variability, and (2) the likelihood of mutation from one residue to another is affected by factors such as genetic code bias and the greater difficulty of accommodating bulky or reactive amino acids. Here, we take the reference state to be amino acid usage in all open reading frames in Saccharomyces cerevisiae [21], which is independent of position but accounts for genetic code bias and amino acid chemistry. Using SFE calculations, we have previously demonstrated that using this reference state gives virtually indistinguishable results from other position-independent reference states such as amino acid usage in all proteins in the Pfam database [16].

We originally noted that the ligand binding site of TPRs was identified by sequence variation using statistical free energies [16]. SFEs are essentially a measure of the difference between amino acid distributions, relating the "probability" of observing a particular distribution to thermodynamic importance based on the exponential relationship given by the Boltzmann law.

$\frac{P}{P_{(r e f)}} = e^{\frac{Δ G_{s t a t}}{k T^{*}}} [1]$

(See Methods for an explanation of kT*.) This approach was introduced by Lockless & Ranganathan; however, their formalism for SFEs does not explicitly calculate the probability of observing a particular positional distribution [14]. Instead it uses the root-mean-square of the binomial probabilities of observing each amino acid, over all twenty amino acids x. That is,

$Δ G_{s t a t} = k T^{*} \sqrt{\sum_{x} {(\ln \frac{P_{x}}{P_{x (r e f)}})}^{2}} [2]$

where P_xis given by

$P_{x} = \frac{N!}{n_{x}! (N - n_{x})!} f_{x}^{n_{x}} {(1 - f_{x})}^{N - n_{x}} [3]$

Here, N is the total number of sequences, n_xis the number of sequences with amino acid x at the given position, and f_xis the expected frequency of x from the reference state. This "vector" formalism for estimating the overall probability is empirically quite effective, but we speculated that a metric more tied to the mathematics of covariation would be more rigorous for our approach.

Since both of the following are true,

$\sum_{x} n_{x} = N and \sum_{x} f_{x} = 1$

the probability of observing a particular distribution is simply given by the multinomial probability,

$P_{m u l t} = \frac{N!}{\prod_{x} n_{x}!} \prod_{x} f_{x} [4]$

As expected, the ΔG_statvalues are very closely related (R² = 0.89) to the ln P_multvalues for the 34 positions in the TPR motif (Figure 2a). Note, however, that the values of ΔG_statand ln P_multare dependent upon the total number of sequences N (since it is much less likely to observe a particular amino acid 200 times out of 1000 than 20 out of 100, if you are expecting it only 5% of the time).

Recently, Dekker et al. suggested that SFEs are merely a measure of sequence (Shannon) entropy (H_i), which implicitly measures how a distribution varies from equal usage [22]. (This is because the maximum entropy arises from a distribution with equal usage.)

$H_{i} = - \sum_{x} p_{x} \ln p_{x} [5]$

Here, p_xis the proportion of sequences with amino acid x at position i. While it is true that ΔG_statis correlated with H_i(Figure 2b), this correlation is a consequence of the reference state of ΔG_statcalculations being fairly close to equal usage of amino acids. It is also affected by the fact that ΔG_statvalues are not based on a rigorous measure of overall probability.

Plotting the multinomial probabilities associated with the distributions at the 34 positions in TPRs against sequence entropy and relative entropy clearly demonstrates that ln P_multmeasures the same thing as relative entropy, but something different from sequence entropy (Figure 2c). (The difference becomes more dramatic the more the reference state deviates from equal usage.) In fact, relative entropy D(p||f) is an information theory approach to measuring the "distance" between distributions, given by,

$D (p | | f) = \sum_{x} p_{x} \ln \frac{p_{x}}{f_{x}} [6]$

It can be shown, using the Stirling approximation for factorials, that multinomial probability is in fact linearly related to relative entropy by the number of sequences (see Supplemental Material).

One significant advantage of relative entropy over multinomial probability and ΔG_statis that relative entropy is independent of the total number of sequences N. Since the intention of the SFE approach is to estimate the significance associated with an amino acid distribution relative to a reference state, we submit that relative entropies are the most convenient way to do this. Relative entropies combine the sample-size independence and ease of calculation of Shannon entropies with the reference-state correction of Lockess & Ranganathan's method, while at the same time measuring that correction in a mathematically-rigorous way.

Other methods, in addition to Shannon entropy and the Lockless & Ranaganthan method, have been suggested for scoring residue conservation, including metrics that account for residue properties such as size or hydrophobicity [23]. There is evidence that binding sites have unique compositional preferences, which may suggest alternative reference states for our method [24]. It will be interesting to examine how attention to property variation may improve our method of dissecting binding sites in structural families.

Effects of sample size

In order to determine how many sequences are required to identify binding residues in TPRs by this method, subsets of various sizes were randomly selected from the 6,887 TPR sequences. In Figure 3a, the average relative entropy values from subsets (5 each) with approximately 6887, 3444, 1722, 861, 430, 215 and 108 randomly-selected sequences are shown for all positions. The overall pattern is evident with as few as about 100 sequences, and there is virtually no difference between subsets with 400 or more sequences. Values from the five random subsets of each size are shown for the seven lowest relative entropy positions in the full data set (Figure 3b). Again, there is essentially no discernable difference down to as few as 400 sequences, and there is not significant variability until one examines fewer than 200 sequences. We therefore expect that this analysis is applicable to protein families with as few as about 200–400 sequences.

Ankyrin repeats: comparison with experiment

A corresponding analysis of ankyrin (Ank) repeats, another experimentally well-characterized protein-binding motif, clearly confirms our prediction that low relative entropies can be used to identify specificity-determining residues when the repertoire of ligand is highly variable. Anks are helix-turn-helix-loop motifs, which bind their ligands with residues in the loops and on the surface of the helical array proximal to the loops [25]. Figure 4 shows the relative entropies from over 15,000 Ank repeats mapped onto the co-crystal structure of mouse GA binding protein β1 with the GABPα ligand bound [26]. Again, our analysis dramatically reveals that the residues known to form the binding site are among the most variable; the positions most like the reference state (relative entropy ≤ 0.39) are 2, 3, 5, 13, 14, 17, 32, and 33.

Significantly, the functional importance of the residues we have identified has already been verified experimentally. Plückthun and colleagues considered four Ank-ligand co-crystal structures and chose key interacting residues in Ank repeats from those whose solvent accessibility was most changed upon ligand binding: positions 2, 3, 5, 13, 14, and 33. Ank repeat domains which bind to different ligands, for example maltose binding protein and two kinases, have been selected from a library of Ank repeat domains in which only these positions were randomized [27, 28]. This result confirms our proposal that these are the specificity-determining residues. In fact, the crystal structure of one of the selected ankyrin proteins that binds MBP directly demonstrates the role of these residues in binding. In addition, position 32 is close in space to these residues, and may well participate in binding for some Ank proteins. It is not clear why position 17, which lies in the turn between the motif's helices, is more variable than other non-binding surface-exposed positions.

Dissecting binding sites when ligands have conserved features

We also analyzed Cys₂His₂ zinc fingers (Zifs) in the same way. In contrast to TPR and Ank repeats, which are stabilized by burial of hydrophobic residues, Zifs are mostly stabilized by ligation of zinc(II) and binding to DNA. Not surprisingly, then, positions that ligate the zinc ion (Cys_-10, Cys_-7, His₊₇, His₊₁₁) and a subset of positions that contact the DNA (e.g., Tyr_-12, Lys_-5, Phe_-3, Arg₊₉) are highly biased (where -1 is the position immediately before the α-helix and the consensus residue is listed). Residues with low relative entropy (≤0.5) are essentially in two patches, near the end of the end of the helix buried in the major groove and on the solvent-exposed surface distal to the DNA. When one considers only the positions that are in contact with the ligand DNA, the residues with the lowest relative entropies (blue and green spheres in Figure 5) are positions -2, -1, 1, 2, 3, 5 and 6. Extensive phage-display selection work has shown that positions -1, 1, 2, 3, 5 and 6 are critical to specificity for the target DNA sequences studied [29, 30]. In contrast, positions that contact the DNA but have no effect on specificity, such as basic residues that make contacts to the phosphate backbone, are essentially invariant (orange and red spheres in Figure 5).

Note that in this case the specificity-determining residues are not necessarily the most variable residues in the motif; they are the most variable residues in the motif that are in contact with the ligand. Thus, for Zifs, sequence hypervariation is not sufficient to identify the binding site, but statistical analysis together with a sample structure reveals specificity-determining positions without further characterization. Apparently, the repertoire of amino acids needed to bind DNA in the major groove is less diverse than that needed for the range of binding functions exhibited by TPR or Ank repeat domains.

A similar phenomenon is observed for PDZ domains, whose peptide ligands have highly conserved elements. PDZ domains are ubiquitous globular protein-protein interaction domains. It is thought that most PDZ domains bind to the C-terminus of target proteins, typically making contact to the carboxyl-terminal four to five residues. PDZ domains can be categorized into two classes (I and II) that bind to consensus sequences X-(S/T)-X-(V/I/L)-CO₂^- and X-Φ-X-Φ-CO₂^-, respectively (where Φ represents a hydrophobic residue and X represents an arbitrary residue) [31]. Since our method identifies residues that vary with the repertoire of ligands (i.e., specificity-determining residues), we would expect that positions that bind to the terminal carboxylate and C-terminal hydrophobic side-chain (P₀) will be highly biased (i.e., conserved); positions that bind to the alcohol or hydrophobic residue at P_-2 will be biased; and positions that bind to P_-1 and P_-3 will be highly variable. In Figure 6b, we have highlighted the positions in the example PDZ domain 3 from PSD95 that have been identified from NMR studies and X-ray co-crystal structures to be involved in binding to the terminal four residues of the ligand (323–328, 339, 340, 342, 372, 373, 376, 379 and 380) [32]. We also included positions 318, 322, 329 and 331, which are within 5 Å of the ligand peptide (KQTSV-CO₂^-) in the example structure (computed with DeepView [33]).

The eight most variable residues in this group are 331, 342, 376, 373, 339, 328, 340 and 326. The most variable residue, 331 (here, a glutamate), contacts the P_-4 lysine, which is variable among PDZ ligands and whose effect on specificity has not been extensively examined. Positions 326, 328, 339, 340 and 342 interact with the variable P_-1 and P_-3 positions. Interestingly, computational redesign of PDZ domain specificity confirms the central importance of these residues in specificity determination. Reina et al. were able to change the specificity of PSD95-3 from KQTSV to KITWV and KRTEV (retaining ligand class, since P₀ and P_-2 are the same) [34]. In the first case, residues 326, 339, 340, 342, 380 and 397 were mutated. In the second case, residues 326, 328, 339, 340, 342 and 397 were mutated. Note that position 397 is outside of the canonical PDZ domain, and so was not examined in our work; position 380 was mutated to improve the stability of the domain, not its binding specificity.

Positions 373 and 376 contact the P_-2 position, which is either an alcohol or a hydrophobic residue depending upon class. The identity of position 372 is known to be highly correlated with the ligand class, because it is typically occupied by a polar residue (often histidine) for class I ligands and by a hydrophobic residue for class II ligands [31]. As expected, position 372 displays intermediate variability (yellow in the figure). We hypothesize that positions 373 and 376 are much more variable than 372 and 380 because they are further away from the P_-2 threonine; in fact, position 373 is farther than 5 Å away. These residues are likely more important for binding class II ligands in which P_-2 is hydrophobic and therefore generally larger. P₀ is a hydrophobic residue in virtually every known PDZ ligand, and it invariantly presents the carboxylate terminus of the peptide. Not surprisingly, then, the residues that it contacts (322–325, 327) are highly biased. In fact, position 324 is a glycine in 97% of PDZ domains, and the turn in which it lies hosts the carboxylate.

It is worth noting that, in contrast to the other three examples presented above, this calculation was carried out on a whole domain instead of a repeat motif. It is also worth noting that the binding site, as in the case of zinc fingers, could not be easily predicted from relative variability of sequence alone due to commonalities among the ligands that result in conserved elements of the binding sites. However, in combination with an example co-crystal structure, the specificity-determining positions can again be inferred from sequence variation, and the inference matches closely what has been derived from extensive biochemical characterization and engineering.

The meaning of relative entropy values

In our previous study of TPRs by SFE analysis, we empirically demonstrated that for a particular sample size and scaling, levels of sequence variation (ΔG_stat) could be usefully grouped as such: 0–1.25, hypervariable or no bias; 1.25–2.5, slight bias; 2.5–5.0, significant bias; 5.0–10.0, dramatic bias; 10+, restriction to a small subset of amino acids [16]. Regression analysis between SFE and relative entropy values for all TPR positions suggests that these values correspond to relative entropies of approximately 0.3, 0.5, 0.9 and 1.5. For convenience, we therefore chose 0.0–0.3, 0.3–0.6, 0.6–0.9, 0.9–1.2 and 1.2+ as bins for coloration of the figures in this publication. The examples in this study suggest that "normal" surface positions typically exhibit relative entropies in the range of 0.3–0.6, and that specificity-determining positions typically have relative entropies less than 0.5. The overlap of these values highlights the difficulty in using this approach as a purely predictive algorithm: only when the repertoire of ligands is extremely diverse (as with TPRs and ankyrin repeats) is there a clear distinction between ligand-binding residues and surface residues in general. We are in the process of a much broader application of this procedure to all of the families in the Pfam database, which we will use to refine the meaning of the relative entropy values in protein families overall (M. Gerstein, T. Gianoulis, T.J.M. and L.R., unpublished work).

Conclusion

The notion that positions that bind diverse ligands will be variable among a family of proteins seems fairly obvious, but this approach has not yet been utilized as a general strategy. One notable precedent is seen in the original studies of antigen-binding sites in antibodies, which were identified as variable regions when the amino acid sequences of antibodies were first determined [35]. Various family-based approaches have been applied to the prediction of functional residues, typically analyzing sequence variability from collections of proteins with similar function (and therefore emphasizing the functional importance of conserved residues). For example, "evolutionary trace" and related methods divide multiply-aligned sequences into subfamilies, typically by phylogeny, comparing patterns of conservation among evolutionarily-related subfamilies and often mapping onto 3D structure [13, 36–42]. Basically, these methods posit that positions that are conserved in all sequences are important for structure, and positions that are conserved within subfamilies (but vary among the sub-types) are important for function (i.e., the function of the proteins in the subfamily).

Here we show how analysis of sequence variability can be enlarged to understand functional variability in whole families of proteins with similar structures. If one collects proteins with the same structure and diverse functions, then structural positions will be conserved and functional positions will vary, and the degree of variation will be related to the degree of variation among the ligands or substrates. In the case of repeat motifs such as TPRs and ankyrins, where structural elements are further divorced from functional conservation by ignoring how the motifs are arranged in domains, the degree of variability is sometimes so profound that it alone can be used to predict the binding site. When the ligands have commonalities, then it becomes more difficult to predict the binding site from variation alone. In that case, as with PDZ domains and Zifs, variation in combination with an example structure still reveals specificity-determining binding positions, which is critical information for re-engineering specificity. (The corollary to this argument is that the pattern of variation of known binding residues will suggest the pattern of variation in the ligands.)

Analysis of overall variance among structurally-related families provides complementary information to methods that analyze variance among evolutionarily-related subfamilies, which have proven very powerful in recent years. A major challenge for these evolutionary trace methods is accurate functional sub-typing, particularly when family members have diverged very significantly. Our method avoids functional sub-typing and, rather, benefits from increased functional divergence of family members (since it results in increased variation among functional positions). Further attention to variable residues in families overall therefore stands to improve exiting methods of functional prediction.

There are hundreds of binding scaffolds with sufficient examples known to permit this type of statistical analysis. The use of rigorous measures of how amino acid distributions differ improves significantly upon conservation alone as a means of identifying important residues within a protein family (this has been reviewed recently [23]). The rapid identification of specificity-determining positions will be useful for the design of proteins with altered binding specificity. The predictions of specificity-determining residues in Ank repeat proteins, Zifs and PDZ domains agree strikingly well with results from considerable structural and biochemical work, and therefore provide a guide for re-engineering binding specificity by design even for protein families lacking extensive characterization. Moreover, knowledge of the specificity-determining residues can be incorporated into evolutionary trace methods to develop a comprehensive view of residues critical for function.

Methods

Sequences

Aligned sequences of TPRs, Ank repeats, C₂H₂ Zifs and PDZ domains were downloaded from Pfam http://pfam.wustl.edu[18]. TPRs of non-canonical length (i.e., not 34 amino acids) were discarded, and only canonical positions were considered with Ank repeats, Zifs and PDZ domains (i.e., ignoring low-occupancy positions from insertions and deletions). All calculations were carried out in Microsoft Excel 2003 on a Dell Latitude C640 with a 2.2 GHz Intel Mobile Pentium 4 processor. Factorials were computed from the Stirling approximation.

Statistical Free Energies

The SFEs associated with each amino acid were determined from application of the Boltzmann law [1], where k is the Boltzmann constant, kT* is an arbitrary energy unit (since the "temperature" of the ensemble T* is not necessarily related to T for conventional systems), and P_refis the probability associated with a hypothetical site with amino acid usage as in the reference state. The binomial probability P_xof the observation of n_xsequences with amino acid x was calculated from [3], where f_xis the frequency of the amino acid in a reference state, all ORFs in Saccharomyces cerevisiae. The SFEs associated with the observed frequencies of the 20 amino acids at each site can be thought of as elements of a 20-dimensional vector. The scalar length of this vector, the root-mean-square average for all amino acids [2], is therefore taken to be the statistical free energy, ΔG_stat, that separates the observed positional amino acid distribution from the reference state. For comparison to Lockless & Ranganathan [14], the ΔG_statvalues were arbitrarily divided by 100. However, the Ranganathan group normalizes the number of sequences to 100, and we have shown that ΔG_statis proportional to N for large N. Therefore, the ΔG_statvalues we calculate are 10-fold larger than those calculated by the Ranganathan method for the same amino acid distribution, since we have normalized to N = 1000.

Other metrics of distribution difference

Multinomial probabilities [4], sequence (Shannon) entropies [5] and relative entropies [6] were calculated as described above. For sequence and relative entropy calculations, the frequencies were calculated as:

$p_{x} = \frac{n_{x} + 1}{N + 1} [7]$

so that ln p_xwas always defined (and is valid as long as N is large). The values used for f_xare listed in our previous paper [16].

Abbreviations

Ank:: ankyrin
HOP:: Hsp-organizing protein
MBP:: maltose binding protein
SFE:: statistical free energy
TPR:: tetratricopeptide repeat protein
Zif:: zinc finger

References

Teichmann SA, Murzin AG, Chothia C: Determination of protein function, evolution and interactions by structural genomics. Current Opinion in Structural Biology 2001, 11: 354–363. 10.1016/S0959-440X(00)00215-3
Article CAS PubMed Google Scholar
Thornton JM: From genome to function. Science 2001, 292: 2095–2097. 10.1126/science.292.5524.2095
Article CAS PubMed Google Scholar
Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. Journal of Molecular Biology 2001, 313: 903–919. 10.1006/jmbi.2001.5080
Article CAS PubMed Google Scholar
Bork P, Koonin EV: Protein sequence motifs. Current Opinion in Structural Biology 1996, 6: 366–376. 10.1016/S0959-440X(96)80057-1
Article CAS PubMed Google Scholar
Whisstock JC, Lesk AM: Prediction of protein function from protein sequence and structure. Quarterly Review of Biophysics 2003, 36: 307–340. 10.1017/S0033583503003901
Article CAS Google Scholar
Yao H, Kristensen DM, Mihalek I, Sowa ME, Shaw C, Kimmel M, Kavraki L, Lichtarge O: An accurate, sensitive, and scalable method to identify functional sites in protein structures. Journal of Molecular Biology 2003, 326: 255–261. 10.1016/S0022-2836(02)01336-0
Article CAS PubMed Google Scholar
Ma B, Elkayam T, Wolfson H, Nussinov R: Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proceedings of the National Academy of Sciences of the United States of America 2003, 100: 5772–5777. 10.1073/pnas.1030237100
Article PubMed Central CAS PubMed Google Scholar
Jones S, Thornton JM: Searching for functional sites in protein structures. Current Opinion in Chemical Biology 2004, 8: 3–7. 10.1016/j.cbpa.2003.11.001
Article CAS PubMed Google Scholar
Mihalek I, Res I, Lichtarge O: A family of evolution-entropy hybrid methods for ranking protein residues by importance. Journal of Molecular Biology 2004, 336: 1265–1282. 10.1016/j.jmb.2003.12.078
Article CAS PubMed Google Scholar
Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. EMBO Journal 1986, 5: 823–826.
PubMed Central CAS PubMed Google Scholar
Bashford D, Chothia C, Lesk AM: Determinants of a protein fold. Unique features of the globin amino acid sequences. Journal of Molecular Biology 1987, 196: 199–216. 10.1016/0022-2836(87)90521-3
Article CAS PubMed Google Scholar
Bowie JU, Reidhaar-Olson JF, Lim WA, Sauer RT: Deciphering the message in protein sequences: tolerance to amino acid substitutions. Science 1990, 247: 1306–1310.
Article CAS PubMed Google Scholar
Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. Journal of Molecular Biology 1996, 257: 342–358. 10.1006/jmbi.1996.0167
Article CAS PubMed Google Scholar
Lockless SW, Ranganathan R: Evolutionarily conserved pathways of energetic connectivity in protein families. Science 1999, 286: 295–299. 10.1126/science.286.5438.295
Article CAS PubMed Google Scholar
Shortle D: Propensities, probabilities and the Boltzmann hypothesis. Protein Science 2003, 12: 1298–1302. 10.1110/ps.0306903
Article PubMed Central CAS PubMed Google Scholar
Magliery TJ, Regan L: Beyond consensus: statistical free energies reveal hidden interactions in the design of a TPR motif. J Mol Biol 2004, 343: 731–745. 10.1016/j.jmb.2004.08.026
Article CAS PubMed Google Scholar
D'Andrea LD, Regan L: TPR proteins: the versatile helix. Trends in Biochemical Sciences 2003, 28: 655–662. 10.1016/j.tibs.2003.10.007
Article PubMed Google Scholar
Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28: 405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L
Article CAS PubMed Google Scholar
Scheufler C, Brinker A, Bourenkov G, Pegoraro S, Moroder L, Bartunik H, Hartl FU, Moarefi I: Structure of TPR domain-peptide complexes: critical elements in the assembly of the Hsp70-Hsp90 multichaperone machine. Cell 2000, 101: 199–210. 10.1016/S0092-8674(00)80830-2
Article CAS PubMed Google Scholar
Gatto GJJ, Geisbrecht BV, Gould SJ, Berg JM: Peroxisomal targeting signal-1 recognition by the TPR domains of human PEX5. Nature Structural Biology 2000, 7: 1091–1095. 10.1038/81930
Article CAS PubMed Google Scholar
Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucleic Acids Research 2002, 30: 42–46. 10.1093/nar/30.1.42
Article PubMed Central CAS PubMed Google Scholar
Dekker JP, Fodor A, Aldrich RW, Yellen G: A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments. Bioinformatics 2004, 20: 1565–1572. 10.1093/bioinformatics/bth128
Article CAS PubMed Google Scholar
Valdar WS: Scoring residue conservation. Proteins 2002, 48: 227–241. 10.1002/prot.10146
Article CAS PubMed Google Scholar
Villar HO, Kauvar LM: Amino acid preferences at protein binding sites. FEBS Lett 1994, 349: 125–130. 10.1016/0014-5793(94)00648-2
Article CAS PubMed Google Scholar
Sedgwick SG, Smerdon SJ: The ankyrin repeat: a diversity of interactions on a common structural framework. Trends in Biochemical Sciences 1999, 24: 311–316. 10.1016/S0968-0004(99)01426-7
Article CAS PubMed Google Scholar
Batchelor AH, Piper DE, de la Brousse FC, McKnight SL, Wolberger C: The structure of GABPalpha/beta: an ETS domain- ankyrin repeat heterodimer bound to DNA. Science 1998, 279: 1037–1041. 10.1126/science.279.5353.1037
Article CAS PubMed Google Scholar
Binz HK, Amstutz P, Kohl A, Stumpp MT, CBriand C, Forrer P, Grutter MG, Pluckthun A: High-affinity binders selected from designed ankyrin repeat protein libraries. Nature Biotechnology 2004, 22: 575–582. 10.1038/nbt962
Article CAS PubMed Google Scholar
Binz HK, Stumpp MT, Forrer P, Amstutz P, Pluckthun A: Designing repeat proteins: well-expressed, soluble and stable proteins from combinatorial libraries of consensus ankyrin repeat proteins. Journal of Molecular Biology 2003, 332: 489–503. 10.1016/S0022-2836(03)00896-9
Article CAS PubMed Google Scholar
Wolfe SA, Nekludova L, Pabo CO: DNA recognition by Cys(2)His(2) zinc finger proteins. Annual Review of Biophysics & Biomolecular Structure 2000, 29: 183–212. 10.1146/annurev.biophys.29.1.183
Article CAS Google Scholar
Pabo CO, Peisach E, Grant RA: Design and selection of novel Cys(2)His(2) zinc finger proteins. Annual Review of Biochemistry 2001, 70: 313–340. 10.1146/annurev.biochem.70.1.313
Article CAS PubMed Google Scholar
Songyang Z, Fanning AS, Fu C, Xu J, Marfatia SM, Chishti AH, Crompton A, Chan AC, Anderson JM, Cantley LC: Recognition of unique carboxyl-terminal motifs by distinct PDZ domains. Science 1997, 275: 73–77. 10.1126/science.275.5296.73
Article CAS PubMed Google Scholar
Wiedemann U, Boisguerin P, Leben R, Leitner D, Krause G, Moelling K, Volkmer-Engert R, Oschkinat H: Quantification of PDZ domain specificity, prediction of ligand affinity and rational design of super-binding peptides. J Mol Biol 2004, 343: 703–718. 10.1016/j.jmb.2004.08.064
Article CAS PubMed Google Scholar
Guex N, Peitsch MC: SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modeling. Electrophoresis 1997, 18: 2714–2723. 10.1002/elps.1150181505
Article CAS PubMed Google Scholar
Reina J, Lacroix E, Hobson SD, Fernandez-Ballester G, Rybin V, Schwab MS, Serrano L, Gonzalez C: Computer-aided design of a PDZ domain to recognize new target sequences. Nat Struct Biol 2002, 9: 621–627.
CAS PubMed Google Scholar
Hilschmann N, Craig LC: Amino acid sequence with Bence-Jones proteins. Proceedings of the National Academy of Sciences of the United States of America 1965, 53: 1403–1409.
Article PubMed Central CAS PubMed Google Scholar
Sali A, Overington JP, Johnson MS, Blundell TL: From comparisons of protein sequences and structures to protein modelling and design. Trends Biochem Sci 1990, 15: 235–240. 10.1016/0968-0004(90)90036-B
Article CAS PubMed Google Scholar
Hannenhalli SS, Russell RB: Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 2000, 303: 61–76. 10.1006/jmbi.2000.4036
Article CAS PubMed Google Scholar
Innis CA, Shi J, Blundell TL: Evolutionary trace analysis of TGF-beta and related growth factors: implications for site-directed mutagenesis. Protein Eng 2000, 13: 839–847. 10.1093/protein/13.12.839
Article CAS PubMed Google Scholar
Landgraf R, Xenarios I, Eisenberg D: Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 2001, 307: 1487–1502. 10.1006/jmbi.2001.4540
Article CAS PubMed Google Scholar
del Sol Mesa A, Pazos F, Valencia A: Automatic methods for predicting functionally important residues. J Mol Biol 2003, 326: 1289–1302. 10.1016/S0022-2836(02)01451-1
Article CAS PubMed Google Scholar
Folkertsma S, van Noort P, Van Durme J, Joosten HJ, Bettler E, Fleuren W, Oliveira L, Horn F, de Vlieg J, Vriend G: A family-based approach reveals the function of residues in the nuclear receptor ligand-binding domain. J Mol Biol 2004, 341: 321–335. 10.1016/j.jmb.2004.05.075
Article CAS PubMed Google Scholar
Berezin C, Glaser F, Rosenberg J, Paz I, Pupko T, Fariselli P, Casadio R, Ben-Tal N: ConSeq: the identification of functionally and structurally important residues in protein sequences. Bioinformatics 2004, 20: 1322–1324. 10.1093/bioinformatics/bth070
Article CAS PubMed Google Scholar
Elrod-Erickson M, Benson TE, Pabo CO: High-resolution structures of variant Zif268-DNA complexes: implications for understanding zinc finger-DNA recognition. Structure 1998, 6: 451–464. 10.1016/S0969-2126(98)00047-1
Article CAS PubMed Google Scholar
Doyle DA, Lee A, Lewis J, Kim E, Sheng M, MacKinnon R: Crystal structures of a complexed and peptide-free membrane protein-binding domain: molecular basis of peptide recognition by PDZ. Cell 1996, 85: 1067–1076. 10.1016/S0092-8674(00)81307-0
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank A. López Cortajarena, T. Kajander, S. Mochrie, J. Venkatraman and C.G.M. Wilson (Yale) and S.S. Licht (MIT) for critical reading of this manuscript and insightful comments. Special thanks to M. Gerstein (Yale) for helpful suggestions on calculating distribution differences. T.J.M. was an N.I.H. Postdoctoral Fellow (GM065750). This work was supported in part by N.I.H. grants GM49146 and GM62413 (L.R.).

Author information

Authors and Affiliations

Department of Molecular Biophysics & Biochemistry, Yale University, P.O. Box 208114, New Haven, CT, 06520-8114, USA
Thomas J Magliery & Lynne Regan
Department of Chemistry and Department of Biochemistry, The Ohio State University, 100 W. 18th Ave., Columbus, OH, 43210, USA
Thomas J Magliery
Department of Chemistry, Yale University, New Haven, CT, USA
Lynne Regan

Authors

Thomas J Magliery
View author publications
You can also search for this author in PubMed Google Scholar
Lynne Regan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Thomas J Magliery or Lynne Regan.

Additional information

Authors' contributions

T.J.M. performed all calculations and analysis, and drafted the manuscript. L.R. aided in interpretation of the data and manuscript preparation, and provided support.

Electronic supplementary material

Additional File 1: The relationship between multinomial probability and relative entropy is derived. (PDF 21 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Magliery, T.J., Regan, L. Sequence variation in ligand binding sites in proteins. BMC Bioinformatics 6, 240 (2005). https://doi.org/10.1186/1471-2105-6-240

Download citation

Received: 15 March 2005
Accepted: 30 September 2005
Published: 30 September 2005
DOI: https://doi.org/10.1186/1471-2105-6-240

Sequence variation in ligand binding sites in proteins

Abstract

Background

Results

Conclusion

Background

Results & discussion

The TPR binding site

Measuring sequence variation

Effects of sample size

Ankyrin repeats: comparison with experiment

Dissecting binding sites when ligands have conserved features

The meaning of relative entropy values

Conclusion

Methods

Sequences

Statistical Free Energies

Other metrics of distribution difference

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Authors' contributions

Electronic supplementary material

Additional File 1: The relationship between multinomial probability and relative entropy is derived. (PDF 21 KB)

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us