BMC Bioinformatics BioMed Central Methodology article Identification of physicochemical selective pressure on protein

Background Statistical methods for identifying positively selected sites in protein coding regions are one of the most commonly used tools in evolutionary bioinformatics. However, they have been limited by not taking the physiochemical properties of amino acids into account. Results We develop a new codon-based likelihood model for detecting site-specific selection pressures acting on specific physicochemical properties. Nonsynonymous substitutions are divided into substitutions that differ with respect to the physicochemical properties of interest, and those that do not. The substitution rates of these two types of changes, relative to the synonymous substitution rate, are then described by two parameters, γ and ω respectively. The new model allows us to perform likelihood ratio tests for positive selection acting on specific physicochemical properties of interest. The new method is first used to analyze simulated data and is shown to have good power and accuracy in detecting physicochemical selective pressure. We then re-analyze data from the class-I alleles of the human Major Histocompatibility Complex (MHC) and from the abalone sperm lysine. Conclusion Our new method allows a more flexible framework to identify selection pressure on particular physicochemical properties.


Background
Traditionally, the nonsynonymous to synonymous rate ratio (the d N /d S ratio, also known as ω) is a measure of the strength of selection acting on a protein coding nucleotide sequence. When the nonsynonymous substitution rate is higher than the synonymous substitution rate at a particular codon site, this site is assumed to be undergoing positive selection, i.e. selection in favor of new nonsynonymous mutations. Conversely, if the nonsynonymous substitution rate is lower than the synonymous substitution rate at a codon site, it is interpreted as evidence for negative selection, i.e. selection against mutations.
It is well-known that the rate of substitution between amino acids at a particular site depends on both its location in the protein (e.g. [1]) and the physicochemical properties of the amino acids involved (e.g. [2][3][4][5][6]). Site specific models have been successful in identifying particular residues, or codon sites, that have been targeted by positive selection (e.g. [7][8][9]). However, relatively little has been known regarding the particular physicochemical properties that have been subject to positive selection. Site-specific information regarding the physicochemical properties that have been subject to positive selection is of great interest in many systems. For example, in viral genes, site-specific information regarding physicochemical properties targeted by selection may shed evolutionary light on the biochemistry underlying mutational evasion of an immune response (e.g. [10]). Therefore, it would be very helpful to have statistical methods that can determine the physicochemical properties subject to selection at specific sites. Sainudiin et al. (2005) [11] conducted a study to detect selection acting on the physicochemical properties at individual codon sites using the maximum likelihood framework by [12] and [13,11] generalized the concept of the nonsynonymous to synonymous rate ratio by replacing it with the property-altering to property-preserving rate ratio (γ) that is invoked by a partition of the amino acids according to the physicochemical properties of interest, to explore the physicochemical properties that were targeted by positive selection. A major limitation of their model is that it divided substitutions into two groups: (1) synonymous mutations and property conserving nonsynonymous mutations, and (2) property altering nonsynonymous mutations.
In this paper, we generalize the idea proposed by [11] by allowing three categories of mutations: synonymous mutations, property conserving nonsynonymous mutations, and property altering nonsynonymous mutations. The rate of the latter two types of mutations, each of which is scaled relative to the rate of synonymous mutations, is denoted by ω and γ, respectively. This allows us to investigate the selective pressure acting on any physicochemical property of interest using γ, while simultaneously accounting for the non-specific selective pressure at the amino acids level using ω. We also develop a set of new mixture models that allows site-specific inferences of selection acting on specific physicochemical properties. Finally, we illustrate the method on both simulated and real data sets.

Implementation
Modeling physicochemical pressure In order to test whether there is selective constraint or preference acting on a particular physicochemical property at the amino acid level, we introduce a new parameter γ within the framework of the continuous-time Markov chain models of codon evolution proposed by [14] and [15]. The state space of the model, assuming the universal genetic code, is given by the 61 sense codons. The 61 × 61 rate matrix Q = {q ij } gives the rate of transition from codon i to codon j. The transition rate from codon i to codon j is assumed to be proportional to the stationary distribution of codon j. Here we propose two modifications to the basic model to allow selection to act on specific physicochemical properties.
In Model A, if codon i and j differ by exactly one nucleotide, then Similarly in Model B, if codon i and j differ by exactly one nucleotide, then In both models q ij equals 0 if codon i and j differ in more than one position, and . The parameter κ is the transition/transversion rate ratio and codon bias is accounted for by incorporating the stationary distribution of codon j. Notice that ω no longer can be interpreted directly as the nonsynonymous/synonymous (d N /d S ) rate ratio, because this ratio also depends on γ.   Although it is a gross simplification of the underlying biology to assume that amino acid mutations can be divided into two categories (property altering and property preserving), models based on this simplification provide considerable computational efficiency while allowing the effect of various physicochemical properties to be explored. The original models by [14], and the models by [16] allow a more complex relationship between rate of substitution and physicochemical properties of the amino acids; they do not, however, simultaneously allow for site-specific variation of these properties in the way the models we propose do. est here is to look for positive selection in the more conventional sense (nonsynonymous to synonymous rate ratio >1), we will analyze simulated and real data sets with model A.
We will construct models that allow site-specific variation by assuming that ω and γ can each have two rate classes: ω 0 = 1, ω 1 >1 and γ 0 = 1, γ 1 >1 and therefore, there are 4 possible site classes to which each codon site can belong.
Mixture models can then be constructed by allowing each codon site to be in each of these classes with a certain probability, that can be estimated from the data. Likelihood ratio tests (LRTs) can be constructed by comparing models with only a few site classes to models with more site classes. For example, we can test whether selection favors substitution that alter the specified physicochemical properties with the null mixture model A with only two site classes, namely (i) and (iii) (i.e. only allowing the site classes with γ ≤ 1) against the alternative mixture model A that allows for all four site classes. We refer to these models as model A2 and A1, respectively (Table 1). Various likelihood ratio tests can also be constructed. For instance, we can obtain evidence for the presence of a significant number of sites in site class (ii) (i.e. the null model has site classes (i), (iii) and (iv) against the full model).
The 15 taxa tree used in the simulation study Figure 1 The 15 taxa tree used in the simulation study. We make the standard assumptions of the codon-based likelihood framework (e.g. [13]); (1) we assume the true topology of the tree is known and (2) each codon site in the sequence is independent of the others. Then we calculate the log likelihood of the data given the topology of the tree and the model by summing up the log-likelihood of each site. The likelihood is optimized by the BFGS algorithm in Numerical Recipes in C [17]. After the maximum likelihood estimates (MLEs) are obtained, we use the Empirical Bayes approach [12] to assign sites to site classes. The source code of the C program which implements both of the models (EvoRadical) has been deposited to sourceforge.net [18] and is licensed under the GPL [19].

Data analysis
In order to investigate the performance of the model we analyze both simulated and real data sets. The simulated data sets were generated using a modified version of Evolver in the PAML 3.13d package [20]. The parameters used for simulation were ω = 0.4 and γ = 4 for data set 1, ω = 4 and γ = 4 for data set 2, ω = γ = 0.4 for data set 3 and ω = γ = 4 for data set 4. The transition/transversion ratio κ was set to 4 for all data sets. We used the volume partition for all four simulated data sets, by dividing amino acids  Table 2.
We re-analyzed the MHC class I data set from [21] ( [22,23]) with Model A. [21] found numerous positively selected sites and showed that most of them are Antigen Recognition Sites (ARS). The same data set was also used in [11], to examine if the new method identifies the same sites as the original method. We also analyzed the abalone sperm lysin data from [23], with regard to the hydrophobicity, volume, polarity and charge physicochemical properties. We specified each of the four physicochemical properties: volume, hydrophobicity, charge, and polarity partitions with both Model A1 and A2 in these analyses.

Simulated data
The result of the simulation analysis is summarized in Table 2 and Figure 2. When the correct partition was used, both the LRT and the posterior Bayesian categorization were very accurate. When simulated data set 1 (with ω≤1 and γ>1) was analyzed using the volume partition (the same partition that the data was simulated from) and there was positive selection acting only on codon substitutions that altered the volume of the encoded amino acids, 99% of the sites were correctly identified as being in the ω≤1, γ>1 category. Moreover, the LRTs performed between Model A1 versus Model A2 on the 100 replicates were all significant. When the data was simulated with ω>1 and γ<1 (data set 2), 97% of the sites were classified χ 3 2 in the correct site class, and none of the LRTs between Models A1 and A2 were significant, as expected.
Amino acids that differ in volume may also differ in other physicochemical properties. Positive selection on another physicochemical property may then be inferred, even though selection may have truly targeted volume. To illustrate this concept and better understand its effect, we repeated the data analysis with other physicochemical properties. We partitioned the amino acids into those with high (A, C, I, V, L, M, F, Y, W, H, T, K, R) and low (R, G, S, D, E, N, Q) hydrophobicity. In each replicate, data was simulated assuming selection only targeted volume, but was analyzed under the hydrophobicity partition. In this case the resulting LRTs were all significant for both data set 1 and data set 2 (ω>1 and γ>1). The results were significant for data set 1 when the hydrophobicity physicochemical property was used due to some volume altering codon changes that are hydrophobicity altering as well (52 out of the 99 distinct amino acid pairs that differ in volume). Likewise, since there are some volume conserving nonsynonymous codon changes that are hydrophobicity altering (44 out of the 91 distinct amino acid pairs that differ in hydrophobicity), the LRTs were significant for data set 2 as well. Thus, the extent of overlap between two sets of distinct amino acid pairs, where each set contains pairs that differ with respect to one of the two given physicochemical properties, ultimately determines our ability to distinguish between the two properties, in terms of being exclusively targeted by positive selection. None-Distribution of the posterior probabilities for the correct site class classification in each of the simulated data sets Figure 2 Distribution of the posterior probabilities for the correct site class classification in each of the simulated data sets. Each data set has 100 replicated and they were analyzed using Model A1 and the volume partition.
theless, our method can reliably identify the true physicochemical property in addition to other properties that have considerable overlap with the true property.
Data sets 3 and 4 were simulated with ω = γ, in which case our model reduces to the original Goldman and Yang 94 model [14]. Our method classified = 99% of the sites correctly with respect to both volume and hydrophobicity partitions for both data set 3 (ω≤1 and γ≤1) and data set 4 (ω>1 and γ>1). The MHC class 1 dataset Table 3 shows the log-likelihoods and the parameter estimates of the analysis of the MHC class 1 dataset. The LRTs of Model A2 against Model A1 were significant in all the partitions examined; indicating that a significant amount of positive selection is acting on the codon substitutions that change the physicochemical property under study. We also found that, for each partition, there are a fair proportion of sites that are in the (ω<1, γ>1) category, suggesting positive selection acting on the particular physicochemical property.
However, when we looked at the sites that were identified in the previous study [11] with high posterior probabilities of being positively selected with respect to a particular physicochemical property; we found that some of these sites were positively selected regardless of the physicochemical property of the amino acids they code for (Table  4). For instance, almost all the posterior probability mass of the 3 sites that were identified for positive selection with respect to the volume partition (63, 67, 93) were more likely to be in the categories with ω>1; and the posterior probabilities of being in the (ω≤1, γ>1) category were almost 0. The same was observed for the polarity par-  and 156 (sites that were previously identified to be positively selected with respect to charge), although they were still more likely to be in the (ω>1, γ>1) category. The main reason for this discrepancy between our study and that of [11] is due to the manner in which physicochemical selection is defined and measured. [11] defined the parameter gamma to be the rate of nonsynonymous substitutions that alter the property of the encoded amino acids and scale it relative to the rate of all other codon substitutions pooled together, i.e. both synonymous substitutions and property-conserving nonsynonymous substitutions. However, we extended their model to allow for a separation of the pooled scaling rate into rates of nonsynonymous property-conserving substitutions and synonymous substitutions. This naturally implies that our parameter gamma is not equivalent to their gamma even if the underlying physicochemical property is the same. However, our parameterization is biologically more realistic since it allows a more sensible scaling with respect to the synonymous substitution rate.  Table 4: Posterior probabilities of being in each site class for the previously identified positively selected sites in the MHC data in [11].

Previously identified sites
Posterior probabilities in the 4 categories with the same partition The abalone sperm lysin dataset The results of the analysis of the abalone sperm lysin data, when amino acids are categorized according to hydrophobicity, volume, polarity and charge are shown in Table 5.

1) The charge partition
Our method identified sites belonging to all 4 classes with high posterior probabilities. We compared with the results obtained by [23] (M3 in Table 4) using the PAML software [20]. We found that all sites in the (ω≥1, γ≥1) site class were also classified as positively selected sites (posterior probabilities > 0.95) by [23] using model M3 of the standard framework. On the other hand, none of the sites in the ω<1 site classes (ω<1, γ<1 and ω<1, γ≥1) were classified as positively selected in their study. Thus, our results identified additional categories of putatively positively selected sites in which selection seems to have favored substitutions that alter specific physiochemical properties of the amino acids.  only targeting a particular physiochemical property while the site remains conserved with respect to other properties.
2) The volume partition There are a number of sites that are targeted for positive selection with respect to the volume property (ω≤1, γ>1), however none of the positively selected sites are conserved under this partition. This may indicates that volume is an important property targeted by positive selection. It is worth mentioning that most of the sites in the (ω≤1, γ>1) site class are adjacent in the 3D structure to a positively selected site previously identified by [23]. For instance, sites 40, 73, 114, 131 are adjacent to sites 41, 74, 113, 132 (Figure 4), and site 127 is adjacent to 126. It is possible that, as [11] pointed out earlier on the basis of the vast site-directed mutagenetic literature, substitutions involving volume change of the residues may slightly change the structure of the peptide which in turns improves the ability to incorporate new mutations.

Discussion
A major limitation of the current approach is that the correct partition cannot be directly inferred from the data. The method cannot statistically identify which amino acid property is being targeted by positive selection without input from the user, however, the program can be used to explore pre-defined hypotheses regarding positive (or negative) selection. Our simulation study showed that when the correct partition has been specified, the method can accurately detect positive selection acting on the associated amino acid property.
We note that the similarities between amino acids are probably better described by continuous functions. Much work has been done on continuous models (e.g. [14,16]) in the context of codon-based likelihood models. However, in the framework of mixture models that allow variation across sites, these approaches are not computationally tractable and require a discrete approximation over many categories in multiple dimensions.
Our new method provides more information than previous methods and yet maintains computational efficiency.
We also want to note that the regularity conditions for the χ 2 approximation are not satisfied with the LRT between A2 and A1. This is because when either of the parameters p 3 or p 4 hit the boundary of the parameter space (i.e. when p 3 = 0 or p 4 = 0), either ω 1 or both ω 1 and γ 1 is non-identifiable.
It is worth mentioning that a significant LRT does not necessarily imply that there is positive selection acting solely on the physicochemical property defining the partition examined. Since the amino acids that differ with respect to physicochemical property may also differ with respect to other properties, it is not possible with the current method to exclude that selection has been targeting other properties over the one specified by the user.
Positive selection inferred by the new method informs the user that there are more substitutions between amino acid in the chosen partition than expected under neutral evolution. A clear advantage of our method over methods that do not take physicochemical properties into account is that it can detect positive selection when selection is only acting on some particular subset of nonsynonymous substitutions, while conserving others. Thus, positive and negative selection acting on the same protein residue can be inferred simultaneously for any specified physicochemical property in a site-specific manner.
This method can also be used to examine the pattern of negative selection in proteins that may not be under positive selection. In many cases, it would be interesting to explore which residues are conserved predominantly with respect to a particular physicochemical property. Site specific models of physicochemical selection provide a more statistically rigorous framework for finding sites that are under selection than method that are based on calculating amino acid content in particular sites without taking the underlying phylogenetic tree into account.

Conclusion
We see this new method as a step forward in incorporating information regarding physicochemical properties into studies of positive selection, as well as a step towards methods that allow identification of selection acting on particular amino acid properties, without any prior specification by the user of which properties to examine. Models that allow site-specific selection on different amino acid properties will be an important tool in studies of molecular evolution and should help bridge the gap between structural biology and molecular evolution.