Incorporating background frequency improves entropy-based residue conservation measures
© Wang and Samudrala. 2006
Received: 14 June 2006
Accepted: 17 August 2006
Published: 17 August 2006
Skip to main content
© Wang and Samudrala. 2006
Received: 14 June 2006
Accepted: 17 August 2006
Published: 17 August 2006
Several entropy-based methods have been developed for scoring sequence conservation in protein multiple sequence alignments. High scoring amino acid positions may correlate with structurally or functionally important residues. However, amino acid background frequencies are usually not taken into account in these entropy-based scoring schemes.
We demonstrate that using a relative entropy measure that incorporates amino acid background frequency results in improved performance in identifying functional sites from protein multiple sequence alignments.
Our results suggest that the application of appropriate background frequency information may lead to more biologically relevant results in many areas of bioinformatics.
Protein multiple sequence alignments are widely used to infer conservation of amino acid residues within an evolutionarily related family [1, 2]. Highly conserved residues tend to correlate with structural and/or functional importance, and accurate identification of such important residues aids in experimental characterization of protein function.
One commonly used sequence conservation measure is the entropy score. In its simplest form, the entropy score for each aligned column in a multiple sequence alignments can be expressed as:
where n aa is the number of residue types in the column representing an alignment position, and p i represents the observed frequency of residue type i in the aligned column. Other, more complicated, residue conservation measures derived from the entropy score have been developed and used in identifying functionally important residues [3–8].
In recent years many sophisticated functional site prediction algorithms have been developed (for reviews, see [9, 10]). Many of these prediction algorithms implicitly or explicitly analyze the amino acid variations in a given position in multiple alignments. The evolutionary trace method analyzes residue variation patterns within and between protein subfamilies from multiple alignments, and maps important residues to protein structure [11, 12]. A further development of this method incorporates entropy information for more accurate ranking of residue importance . Oliveira et al devised entropy-variability plots that can be used to identify structurally and functionally important residues . Pei et al used the conservation difference between artificial sequence profiles and naturally occurring sequence profiles to detect homology and identify active sites . Soyer et al used site-specific evolutionary models for predicting functional sites in proteins . Wang et al used linear models to analyze multiple alignments for mutated viral proteins to identify amino acid positions important for drug resistance . Chelliah et al identified interaction sites by separating the structural and functional constraints for each position in multiple alignments . Greaves et al used geometry-based and sequence profile-based calculation for predicting enzyme active sites . Cheng et al introduced a hybrid method incorporating both sequence conservation and structural stability to predict functional sites . The SIFT server automatically constructs multiple alignments for query sequence and then predicts amino acid substitutions that are likely to affect protein function . The ConSurf server identifies protein functional region by surface mapping of phylogenetic information inferred from multiple alignments . The MINER server uses phylogenetic motifs from multiple alignments to identify protein functional site regions . Besides functional site identification, the residue conservation information can also be used to identify residues determining subfamiliy functional specificity [24–28]. The prevalence of these methods suggests that the extraction of conservation information from multiple alignments is important for the correct prediction of functional residues or regions.
Here, we argue that entropy scores that do not incorporate background amino acid frequencies are not theoretically optimal for calculating residue conservation. To demonstrate this, we rewrite the entropy score using a uniform amino acid frequency distribution P u (p u = 1/n aa for each residue type in the aligned column):
where D(P i || P u ) is commonly referred to as the relative entropy or Kullback-Leibler divergence. Therefore, the entropy score is numerically identical to the negative relative entropy between the observed amino acid frequency distribution P i and a uniform distribution P u , minus a constant. In statistics, relative entropy arises as the expected logarithm of the likelihood ratio, and it may be used as a measure of the distance between two probability distributions . In the case of residue conservation, the higher deviation from the "background" indicates stronger evolutionary constraint, which suggests that this position may perform an important functional role. However, nature does not sample every amino acid equally when creating proteins. Therefore, the simple uniform distribution P u above is not optimal as a reference distribution to evaluate functional importance.
We propose that a relative entropy measure incorporating the observed background frequency from protein sequence databases would be a better measure to capture the functional importance of amino acid residues. More specifically, we propose to use the formula:
where P ib can represent the background amino acid frequencies found in naturally occurring protein sequences, or any other arbitrary set of background frequencies. This measure will increase the scores for aligned columns containing "rare" residues, which are often functionally important. To explain this in a more intuitive way, consider two invariant positions that have only cysteines and only serines, respectively. The position with cysteines is more likely to be functional. The entropy measure will assign the same score to the two positions, but the relative entropy measure assigns a higher score to the invariant cysteine position, since cysteine has a much lower background frequency (~2%) than serine (~7%).
To investigate whether our proposed relative entropy measure (S relative_entropy ) is more sensitive than the entropy measure (S entropy ) in detecting functional sites, we evaluated the performance of these measures to identify functionally important residues using the Thornton  and Lovell datasets . In addition, to investigate how the use of different background frequencies affects the performance of the relative entropy method, we used two sets of frequencies: the general background frequencies observed in nature and the family-specific background frequencies retrieved from the alignments for each query sequence.
We further investigated whether entropy-based methods could benefit from incorporating more accurate amino acid frequency information. We compiled a HMM model from multiple alignments for each query sequence, and calculated the positional entropy or relative entropy (using the general background frequencies) for each aligned column, similar to a previous study . There are two main advantages of using HMM models: (1) sequences are weighted so that the effects of uneven or biased database sampling are reduced and (2) frequencies of unobserved amino acids can be estimated through the use of Dirichlet mixtures .
Improvements for functional site prediction can occur by increasing the sophistication of the measures considered: The relative entropy method only scores individual positions without considering neighboring residues. A method that analyzes context (neighboring residues) may improve performance. The multiple alignments generated by the PSI-BLAST program may not be optimal, and more accurate multiple alignments (such as those generated by the HMM method) may improve performance. In addition, an appropriate treatment of gaps and sequence weights (for example, a family specific treatment), consideration of phylogenetic relationships among sequences, and analysis of local structural information when available, is also likely to improve performance. We believe that a hybrid method that incorporates the relative entropy method and all the above improvements will have significantly better performance for functional site prediction.
In conclusion, the use of background frequency information significantly improves entropy-based functional site prediction. This principle has been advocated before (such as in ), but its use has been very limited: For example, sequence logo is widely used to visually display conserved nucleotide or amino acid sites in sequence motifs; however, many logo generation programs [35–38] are unable to accept user-supplied background frequencies. (Some exceptions include the PICTOGRAM server  and the CONSENSUS server .) In addition, several programs for editing or annotating multiple alignments exist [41, 42], but they are unable to use background frequencies to calculate relative entropy for each aligned residue.
The use of background information is limited in other areas of bioinformatics as well. For example, many programs are available to identify "functional enrichment" for a list of genes from microarray experiment, but only a few of them are able to accept a user-supplied list of "background genes". Dozens of tools are available to identify transcription factor binding sites (TFBS) or other functional motifs in a given sequence, but few of them are able to take into account the background frequency of predicted TFBS or motifs in the corresponding genome (exceptions include ). Fold recognition methods are widely used to assign a query sequence to a structural fold, but few considers the relative abundance (or prior probability) of candidate folds in the corresponding proteome. The broader application of appropriate background information in all areas of bioinformatics will lead to more biologically relevant results.
We used two datasets consisting of protein functional sites for evaluating different algorithms. The Thornton dataset was compiled manually from primary literature on known protein structures and was shown to be more comprehensive and specific than the SITE annotation in PDB files . The Lovell dataset contains manually compiled protein functional sites, including ligand binding sites and enzyme active sites . The Thornton dataset contains 1,546 enzyme active sites from 508 proteins, and the Lovell dataset contains 1,137 functional sites from 243 proteins. The same versions of the two sets have been used previously for structure-based searching of functional sites .
We used two criteria for evaluating the performance of functional site prediction algorithms. The first criterion is the ROC score, which computes the area under a curve that plots fraction of true positives versus false positives by varying the threshold value of classification. A perfect classification algorithm that puts all the functional sites at the top of the ranked residue list has an ROC score of 1, and a random classification algorithm has an ROC score of 0.5. The second criterion is the "top 10 hits" score, which computes the number of functional sites among the top 10 scoring residues in a given protein. A perfect classification algorithm has a top 10 hits score of N fun or 10 (if N fun is more than 10), while a random classification algorithm has an average top 10 hits score of 10*N fun /N (N fun and N denote the number of functional residues and the number of all residues in the given protein, respectively).
We evaluated several functional site identification methods that take protein multiple sequence alignments as input for their predictions. For the Thornton and Lovell datasets, we generated multiple sequence alignments by searching each query sequence against the Uniref90 database  with the PSI-BLAST program blastpgp in the BLAST program package . We used three iterations (through the "-j 3" option), the "-m 6" display option and all other default parameters for the PSI-BLAST program. The resulting multiple sequence alignments were converted to ClustalW format, and then analyzed by various scoring methods described below.
For the entropy and relative entropy method, we used equation (1) and (3) in the main text to assign a score to each aligned column in the multiple alignments. We treated gaps in the same way as an amino acid, though we found little performance difference when ignoring gaps. Only the residue types that appear in aligned columns were used in the computation of the relative entropy score. The background distribution P ib in equation (3) can be varied to explore the use of different background frequencies. In our benchmarking experiment, we used two different sets of background frequencies: (1) the general background frequencies defined in karlin.c program of the BLAST package , and (2) the family-specific background frequencies observed in the multiple alignments for each query sequence.
For the HMM-derived entropy and relative entropy method, we first compiled a HMM model using the hmmbuild program in the HMMER package  with default parameters, and then calculated the positional entropy or relative entropy using amino acid frequencies estimated by the HMM model. The general background frequencies are used for the relative entropy computation.
We also used three conservation measures implemented in the AL2CO program , including the entropy measure (AL2CO_entropy), the variance measure (AL2CO_variance) and the sum-of-pairs measure (AL2CO_sop). All the three methods use the default "independent count" scheme for weighting sequences . The default parameters for the AL2CO program were used for computation, except that the BLOSUM62 matrix rather than identity matrix was used in the sum-of-pairs method. For the SCORECONS method , we used the scorecons program with default parameters.
This work was supported by a Searle Scholar Award, a NSF CAREER award, NSF grant DBI-0217241, and NIH grant GM068152-01. We wish to thank members of the Samudrala group and Dr. Sridhar Hannenhalli for helpful discussions and comments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.