Incorporating background frequency improves entropy-based residue conservation measures
© Wang and Samudrala; licensee BioMed Central Ltd. 2006
Received: 14 June 2006
Accepted: 17 August 2006
Published: 17 August 2006
Several entropy-based methods have been developed for scoring sequence conservation in protein multiple sequence alignments. High scoring amino acid positions may correlate with structurally or functionally important residues. However, amino acid background frequencies are usually not taken into account in these entropy-based scoring schemes.
We demonstrate that using a relative entropy measure that incorporates amino acid background frequency results in improved performance in identifying functional sites from protein multiple sequence alignments.
Our results suggest that the application of appropriate background frequency information may lead to more biologically relevant results in many areas of bioinformatics.
Protein multiple sequence alignments are widely used to infer conservation of amino acid residues within an evolutionarily related family [1, 2]. Highly conserved residues tend to correlate with structural and/or functional importance, and accurate identification of such important residues aids in experimental characterization of protein function.
One commonly used sequence conservation measure is the entropy score. In its simplest form, the entropy score for each aligned column in a multiple sequence alignments can be expressed as:
where n aa is the number of residue types in the column representing an alignment position, and p i represents the observed frequency of residue type i in the aligned column. Other, more complicated, residue conservation measures derived from the entropy score have been developed and used in identifying functionally important residues [3–8].
In recent years many sophisticated functional site prediction algorithms have been developed (for reviews, see [9, 10]). Many of these prediction algorithms implicitly or explicitly analyze the amino acid variations in a given position in multiple alignments. The evolutionary trace method analyzes residue variation patterns within and between protein subfamilies from multiple alignments, and maps important residues to protein structure [11, 12]. A further development of this method incorporates entropy information for more accurate ranking of residue importance . Oliveira et al devised entropy-variability plots that can be used to identify structurally and functionally important residues . Pei et al used the conservation difference between artificial sequence profiles and naturally occurring sequence profiles to detect homology and identify active sites . Soyer et al used site-specific evolutionary models for predicting functional sites in proteins . Wang et al used linear models to analyze multiple alignments for mutated viral proteins to identify amino acid positions important for drug resistance . Chelliah et al identified interaction sites by separating the structural and functional constraints for each position in multiple alignments . Greaves et al used geometry-based and sequence profile-based calculation for predicting enzyme active sites . Cheng et al introduced a hybrid method incorporating both sequence conservation and structural stability to predict functional sites . The SIFT server automatically constructs multiple alignments for query sequence and then predicts amino acid substitutions that are likely to affect protein function . The ConSurf server identifies protein functional region by surface mapping of phylogenetic information inferred from multiple alignments . The MINER server uses phylogenetic motifs from multiple alignments to identify protein functional site regions . Besides functional site identification, the residue conservation information can also be used to identify residues determining subfamiliy functional specificity [24–28]. The prevalence of these methods suggests that the extraction of conservation information from multiple alignments is important for the correct prediction of functional residues or regions.
Results and discussion
Here, we argue that entropy scores that do not incorporate background amino acid frequencies are not theoretically optimal for calculating residue conservation. To demonstrate this, we rewrite the entropy score using a uniform amino acid frequency distribution P u (p u = 1/n aa for each residue type in the aligned column):
where D(P i || P u ) is commonly referred to as the relative entropy or Kullback-Leibler divergence. Therefore, the entropy score is numerically identical to the negative relative entropy between the observed amino acid frequency distribution P i and a uniform distribution P u , minus a constant. In statistics, relative entropy arises as the expected logarithm of the likelihood ratio, and it may be used as a measure of the distance between two probability distributions . In the case of residue conservation, the higher deviation from the "background" indicates stronger evolutionary constraint, which suggests that this position may perform an important functional role. However, nature does not sample every amino acid equally when creating proteins. Therefore, the simple uniform distribution P u above is not optimal as a reference distribution to evaluate functional importance.
We propose that a relative entropy measure incorporating the observed background frequency from protein sequence databases would be a better measure to capture the functional importance of amino acid residues. More specifically, we propose to use the formula:
where P ib can represent the background amino acid frequencies found in naturally occurring protein sequences, or any other arbitrary set of background frequencies. This measure will increase the scores for aligned columns containing "rare" residues, which are often functionally important. To explain this in a more intuitive way, consider two invariant positions that have only cysteines and only serines, respectively. The position with cysteines is more likely to be functional. The entropy measure will assign the same score to the two positions, but the relative entropy measure assigns a higher score to the invariant cysteine position, since cysteine has a much lower background frequency (~2%) than serine (~7%).
Comparative analysis of entropy score and relative entropy score
To investigate whether our proposed relative entropy measure (S relative_entropy ) is more sensitive than the entropy measure (S entropy ) in detecting functional sites, we evaluated the performance of these measures to identify functionally important residues using the Thornton  and Lovell datasets . In addition, to investigate how the use of different background frequencies affects the performance of the relative entropy method, we used two sets of frequencies: the general background frequencies observed in nature and the family-specific background frequencies retrieved from the alignments for each query sequence.
Comparative analysis of entropy scores using more accurate frequency estimates
We further investigated whether entropy-based methods could benefit from incorporating more accurate amino acid frequency information. We compiled a HMM model from multiple alignments for each query sequence, and calculated the positional entropy or relative entropy (using the general background frequencies) for each aligned column, similar to a previous study . There are two main advantages of using HMM models: (1) sequences are weighted so that the effects of uneven or biased database sampling are reduced and (2) frequencies of unobserved amino acids can be estimated through the use of Dirichlet mixtures .
Improvements for functional site prediction can occur by increasing the sophistication of the measures considered: The relative entropy method only scores individual positions without considering neighboring residues. A method that analyzes context (neighboring residues) may improve performance. The multiple alignments generated by the PSI-BLAST program may not be optimal, and more accurate multiple alignments (such as those generated by the HMM method) may improve performance. In addition, an appropriate treatment of gaps and sequence weights (for example, a family specific treatment), consideration of phylogenetic relationships among sequences, and analysis of local structural information when available, is also likely to improve performance. We believe that a hybrid method that incorporates the relative entropy method and all the above improvements will have significantly better performance for functional site prediction.
In conclusion, the use of background frequency information significantly improves entropy-based functional site prediction. This principle has been advocated before (such as in ), but its use has been very limited: For example, sequence logo is widely used to visually display conserved nucleotide or amino acid sites in sequence motifs; however, many logo generation programs [35–38] are unable to accept user-supplied background frequencies. (Some exceptions include the PICTOGRAM server  and the CONSENSUS server .) In addition, several programs for editing or annotating multiple alignments exist [41, 42], but they are unable to use background frequencies to calculate relative entropy for each aligned residue.
The use of background information is limited in other areas of bioinformatics as well. For example, many programs are available to identify "functional enrichment" for a list of genes from microarray experiment, but only a few of them are able to accept a user-supplied list of "background genes". Dozens of tools are available to identify transcription factor binding sites (TFBS) or other functional motifs in a given sequence, but few of them are able to take into account the background frequency of predicted TFBS or motifs in the corresponding genome (exceptions include ). Fold recognition methods are widely used to assign a query sequence to a structural fold, but few considers the relative abundance (or prior probability) of candidate folds in the corresponding proteome. The broader application of appropriate background information in all areas of bioinformatics will lead to more biologically relevant results.
We used two datasets consisting of protein functional sites for evaluating different algorithms. The Thornton dataset was compiled manually from primary literature on known protein structures and was shown to be more comprehensive and specific than the SITE annotation in PDB files . The Lovell dataset contains manually compiled protein functional sites, including ligand binding sites and enzyme active sites . The Thornton dataset contains 1,546 enzyme active sites from 508 proteins, and the Lovell dataset contains 1,137 functional sites from 243 proteins. The same versions of the two sets have been used previously for structure-based searching of functional sites .
Criteria for performance evaluation
We used two criteria for evaluating the performance of functional site prediction algorithms. The first criterion is the ROC score, which computes the area under a curve that plots fraction of true positives versus false positives by varying the threshold value of classification. A perfect classification algorithm that puts all the functional sites at the top of the ranked residue list has an ROC score of 1, and a random classification algorithm has an ROC score of 0.5. The second criterion is the "top 10 hits" score, which computes the number of functional sites among the top 10 scoring residues in a given protein. A perfect classification algorithm has a top 10 hits score of N fun or 10 (if N fun is more than 10), while a random classification algorithm has an average top 10 hits score of 10*N fun /N (N fun and N denote the number of functional residues and the number of all residues in the given protein, respectively).
Functional site identification
We evaluated several functional site identification methods that take protein multiple sequence alignments as input for their predictions. For the Thornton and Lovell datasets, we generated multiple sequence alignments by searching each query sequence against the Uniref90 database  with the PSI-BLAST program blastpgp in the BLAST program package . We used three iterations (through the "-j 3" option), the "-m 6" display option and all other default parameters for the PSI-BLAST program. The resulting multiple sequence alignments were converted to ClustalW format, and then analyzed by various scoring methods described below.
For the entropy and relative entropy method, we used equation (1) and (3) in the main text to assign a score to each aligned column in the multiple alignments. We treated gaps in the same way as an amino acid, though we found little performance difference when ignoring gaps. Only the residue types that appear in aligned columns were used in the computation of the relative entropy score. The background distribution P ib in equation (3) can be varied to explore the use of different background frequencies. In our benchmarking experiment, we used two different sets of background frequencies: (1) the general background frequencies defined in karlin.c program of the BLAST package , and (2) the family-specific background frequencies observed in the multiple alignments for each query sequence.
For the HMM-derived entropy and relative entropy method, we first compiled a HMM model using the hmmbuild program in the HMMER package  with default parameters, and then calculated the positional entropy or relative entropy using amino acid frequencies estimated by the HMM model. The general background frequencies are used for the relative entropy computation.
We also used three conservation measures implemented in the AL2CO program , including the entropy measure (AL2CO_entropy), the variance measure (AL2CO_variance) and the sum-of-pairs measure (AL2CO_sop). All the three methods use the default "independent count" scheme for weighting sequences . The default parameters for the AL2CO program were used for computation, except that the BLOSUM62 matrix rather than identity matrix was used in the sum-of-pairs method. For the SCORECONS method , we used the scorecons program with default parameters.
This work was supported by a Searle Scholar Award, a NSF CAREER award, NSF grant DBI-0217241, and NIH grant GM068152-01. We wish to thank members of the Samudrala group and Dr. Sridhar Hannenhalli for helpful discussions and comments.
- Valdar WS: Scoring residue conservation. Proteins 2002, 48(2):227–241. 10.1002/prot.10146View ArticlePubMed
- Pei J, Grishin NV: AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 2001, 17(8):700–712. 10.1093/bioinformatics/17.8.700View ArticlePubMed
- Sander C, Schneider R: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991, 9(1):56–68. 10.1002/prot.340090107View ArticlePubMed
- Shenkin PS, Erman B, Mastrandrea LD: Information-theoretical entropy as a measure of sequence variability. Proteins 1991, 11(4):297–313. 10.1002/prot.340110408View ArticlePubMed
- Gerstein M, Altman RB: Average core structures and variability measures for protein families: application to the immunoglobulins. J Mol Biol 1995, 251(1):161–175. 10.1006/jmbi.1995.0423View ArticlePubMed
- Williamson RM: Information theory analysis of the relationship between primary sequence structure and ligand recognition among a class of facilitated transporters. J Theor Biol 1995, 174(2):179–188. 10.1006/jtbi.1995.0090View ArticlePubMed
- Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol 1999, 291(1):177–196. 10.1006/jmbi.1999.2911View ArticlePubMed
- Plaxco KW, Larson S, Ruczinski I, Riddle DS, Thayer EC, Buchwitz B, Davidson AR, Baker D: Evolutionary conservation in protein folding kinetics. J Mol Biol 2000, 298(2):303–312. 10.1006/jmbi.1999.3663View ArticlePubMed
- Jones S, Thornton JM: Searching for functional sites in protein structures. Curr Opin Chem Biol 2004, 8(1):3–7. 10.1016/j.cbpa.2003.11.001View ArticlePubMed
- Watson JD, Laskowski RA, Thornton JM: Predicting protein function from sequence and structural data. Curr Opin Struct Biol 2005, 15(3):275–284. 10.1016/j.sbi.2005.04.003View ArticlePubMed
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257(2):342–358. 10.1006/jmbi.1996.0167View ArticlePubMed
- Yao H, Kristensen DM, Mihalek I, Sowa ME, Shaw C, Kimmel M, Kavraki L, Lichtarge O: An accurate, sensitive, and scalable method to identify functional sites in protein structures. J Mol Biol 2003, 326(1):255–261. 10.1016/S0022-2836(02)01336-0View ArticlePubMed
- Mihalek I, Res I, Lichtarge O: A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol 2004, 336(5):1265–1282. 10.1016/j.jmb.2003.12.078View ArticlePubMed
- Oliveira L, Paiva PB, Paiva AC, Vriend G: Identification of functionally conserved residues with the use of entropy-variability plots. Proteins 2003, 52(4):544–552. 10.1002/prot.10490View ArticlePubMed
- Pei J, Dokholyan NV, Shakhnovich EI, Grishin NV: Using protein design for homology detection and active site searches. Proc Natl Acad Sci U S A 2003, 100(20):11361–11366. 10.1073/pnas.2034878100PubMed CentralView ArticlePubMed
- Soyer OS, Goldstein RA: Predicting functional sites in proteins: site-specific evolutionary models and their application to neurotransmitter transporters. J Mol Biol 2004, 339(1):227–242. 10.1016/j.jmb.2004.03.025View ArticlePubMed
- Wang K, Jenwitheesuk E, Samudrala R, Mittler JE: Simple linear model provides highly accurate genotypic predictions of HIV-1 drug resistance. Antivir Ther 2004, 9(3):343–352.PubMed
- Chelliah V, Chen L, Blundell TL, Lovell SC: Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J Mol Biol 2004, 342(5):1487–1504. 10.1016/j.jmb.2004.08.022View ArticlePubMed
- Greaves R, Warwicker J: Active site identification through geometry-based and sequence profile-based calculations: burial of catalytic clefts. J Mol Biol 2005, 349(3):547–557. 10.1016/j.jmb.2005.04.018View ArticlePubMed
- Cheng G, Qian B, Samudrala R, Baker D: Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design. Nucleic Acids Res 2005, 33(18):5861–5867. 10.1093/nar/gki894PubMed CentralView ArticlePubMed
- Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 2003, 31(13):3812–3814. 10.1093/nar/gkg509PubMed CentralView ArticlePubMed
- Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 2005, 33(Web Server issue):W299–302. 10.1093/nar/gki370PubMed CentralView ArticlePubMed
- La D, Livesay DR: Predicting functional sites with an automated algorithm suitable for heterogeneous datasets. BMC Bioinformatics 2005, 6: 116. 10.1186/1471-2105-6-116PubMed CentralView ArticlePubMed
- Hannenhalli SS, Russell RB: Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 2000, 303(1):61–76. 10.1006/jmbi.2000.4036View ArticlePubMed
- Vilim RB, Cunningham RM, Lu B, Kheradpour P, Stevens FJ: Fold-specific substitution matrices for protein classification. Bioinformatics 2004, 20(6):847–853. 10.1093/bioinformatics/btg492View ArticlePubMed
- Bielawski JP, Yang Z: A maximum likelihood method for detecting functional divergence at individual codon sites, with application to gene family evolution. J Mol Evol 2004, 59(1):121–132. 10.1007/s00239-004-2597-8View ArticlePubMed
- Pei J, Cai W, Kinch LN, Grishin NV: Prediction of functional specificity determinants from protein sequences using log-likelihood ratios. Bioinformatics 2006, 22(2):164–171. 10.1093/bioinformatics/bti766View ArticlePubMed
- Mirny LA, Gelfand MS: Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J Mol Biol 2002, 321(1):7–20. 10.1016/S0022-2836(02)00587-9View ArticlePubMed
- Cover TM, Thomas JA: Elements of information theory. In Wiley series in telecommunications. Edited by: Schilling DL. New York, John Wiley & Sons; 1991.
- Porter CT, Bartlett GJ, Thornton JM: The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 2004, 32 Database issue: D129–33. 10.1093/nar/gkh028View Article
- Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D: Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 1996, 12(4):327–345.PubMed
- Valdar WS, Thornton JM: Protein-protein interfaces: analysis of amino acid conservation in homodimers. Proteins 2001, 42(1):108–124. 10.1002/1097-0134(20010101)42:1<108::AID-PROT110>3.0.CO;2-OView ArticlePubMed
- Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG, Kuznetsov EN: PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng 1999, 12(5):387–394. 10.1093/protein/12.5.387View ArticlePubMed
- Stormo GD: Information content and free energy in DNA--protein interactions. J Theor Biol 1998, 195(1):135–137. 10.1006/jtbi.1998.0785View ArticlePubMed
- Schuster-Bockler B, Schultz J, Rahmann S: HMM Logos for visualization of protein families. BMC Bioinformatics 2004, 5: 7. 10.1186/1471-2105-5-7PubMed CentralView ArticlePubMed
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res 2004, 14(6):1188–1190. 10.1101/gr.849004PubMed CentralView ArticlePubMed
- Bindewald E, Schneider TD, Shapiro BA: CorreLogo: an online server for 3D sequence logos of RNA and DNA alignments. Nucleic Acids Res 2006, 34(Web Server issue):W405–11.PubMed CentralView ArticlePubMed
- Workman CT, Yin Y, Corcoran DL, Ideker T, Stormo GD, Benos PV: enoLOGOS: a versatile web tool for energy normalized sequence logos. Nucleic Acids Res 2005, 33(Web Server issue):W389–92. 10.1093/nar/gki439PubMed CentralView ArticlePubMed
- PICTOGRAM: [http://genes.mit.edu/pictogram.html].
- CONSENSUS: [http://adric.wustl.edu/oldconsensus].
- Clamp M, Cuff J, Searle SM, Barton GJ: The Jalview Java alignment editor. Bioinformatics 2004, 20(3):426–427. 10.1093/bioinformatics/btg430View ArticlePubMed
- Johnson JM, Mason K, Moallemi C, Xi H, Somaroo S, Huang ES: Protein family annotation in a multiple alignment viewer. Bioinformatics 2003, 19(4):544–545. 10.1093/bioinformatics/btg021View ArticlePubMed
- Levy S, Hannenhalli S: Identification of transcription factor binding sites in the human genome sequence. Mamm Genome 2002, 13(9):510–514. 10.1007/s00335-002-2175-6View ArticlePubMed
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34(Database issue):D187–91. 10.1093/nar/gkj161PubMed CentralView ArticlePubMed
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14(9):755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.