Environment specific substitution tables for thermophilic proteins

Background Thermophilic organisms are able to live at high temperatures ranging from 50 to > 100°C. Their proteins must be sufficiently stable to function under these extreme conditions; however, the basis for thermostability remains elusive. Subtle differences between thermophilic and mesophilic molecules can be found when sequences or structures from homologous proteins are compared, but often these differences are family-specific and few general rules have been derived. The availability of complete genome sequences has now made it feasible to perform a large-scale comparison between mesophilic and thermophilic proteins, the latter of which primarily come from archaeal genomes although a few complete genomes of thermophilic eubacteria are also available. Results We compared mesophilic proteins with their thermophilic counterparts of archaeal or eubacterial origins independently. This was based on the assumption that in these two kingdoms, different mechanisms may have been exploited for the adaptation of proteins at high temperatures. We derived the environment specific amino acid compositions of thermophilic proteins from 10 archaeal and seven eubacterial genomes, by aligning a large number of sequences from thermophilic proteins with their close mesophilic homologues of known three-dimensional (3D) structure. We further analysed environment specific substitutions, which lead from mesophilic proteins to either archaeal or eubacterial thermophilic proteins. Conclusion Our comparisons were based on homology-based structural predictions for a large number of thermophilic proteins. We demonstrated that thermal adaptation in the archaeal and eubacterial kingdoms is achieved in different ways. The main differences concern the usage of Gln, Ile and positively charged amino acids. In particular archaeal organisms appeared to have acquired thermostability by substituting non-charged polar amino acids (such as Gln) with Glu and Lys, and non-polar amino acids with Ile on the surface of proteins.


Background
Thermophilic organisms are able to live at high temperatures ranging from 50 to > 100°C. They belong either to the archaeal or the eubacterial kingdom and they have been subdivided, setting a somewhat arbitrary temperature boundary, into thermophiles and hyperthermophiles. Initially, most archaeal species were isolated from extreme habitats but it has recently become clear that archaea, as well as eubacteria, are widespread and abundant in several diverse niches [1].
Thermophilic organisms are interesting for several reasons and in particular because they are a source of very stable proteins. Understanding the higher-temperature resistance of thermophilic proteins is essential for the studies of protein folding and stability, and is critical for designing efficient enzymes that can work at high temperatures. Although many studies have been carried out for several decades, it has so far been difficult to identify any single factor as being primarily responsible for enhancing thermal stability. This is probably because protein stability is determined by a fine balance between several contributing factors. Moreover, even considering multiple factors, few general rules have been derived and often rules derived for one protein family did not apply to other families.
The analysis of high quality structures would be the most informative approach if a large dataset were available, but unfortunately this is not the case. On the other hand, the analysis of whole genomes can benefit from the increasing availability of large numbers of protein sequences. It is, therefore, desirable to combine the advantages of both approaches. This can be achieved by aligning sequences from thermophilic proteins with their close mesophilic homologues of known 3D structure. Since structure is better conserved than sequence, the alignment of thermophilic sequences to a homologous structure implies a likely 3D mapping of the protein sequences in question, yielding homology-based structural predictions for many proteins [18].
Aided by the recent progress in genome sequencing, we compared, in this paper, mesophilic proteins with archaeal and eubacterial thermophilic proteins separately. In doing so, we were motivated by the consideration that different strategies for thermal adaptation might have been exploited by organisms evolutionarily distant and that merging results obtained from thermophilic archaea and eubacteria might have hindered the previous attempts to identify the determinants of protein stability. We derived new general rules for thermal adaptation, specific to archaea and eubacteria.

Databases
Two protein databases were created for organisms living above 50°C. One included 19,168 protein sequences derived from the genomes of 10 archaea, the other 17,040 protein sequences from the genomes of seven eubacteria. In Table 1 we report the names of the organisms, the temperature at which they live (OGT) and the GC content of their genomes. Hyperthermophiles, i.e., organisms that live above 80°C are more frequently found in the archaeal kingdom. However, GC content does not correlate with OGTs and on average, is only slightly higher in eubacteria than in archaea.
A set of 3763 protein structures belonging to 1057 different families were taken from HOMSTRAD, a database of protein structural alignments for homologous families [19]. The sequence corresponding to each structure was used as a query to search separately against the two databases of thermophilic proteins. We used BLAST [20] under stringent conditions and detected close archaeal homologues of 1005 HOMSTRAD proteins and close eubacterial homologues of 1580 HOMSTRAD proteins. Accordingly we built 1005 alignments for archaea and 1580 alignments for eubacteria, where the first sequence from a mesophilic protein is aligned against its thermophilic homologues. The residues of the first protein, the structure of which is known, were assigned to one of eight different structural environments; alpha helix, exposed (HA) or buried (Ha), beta strand, exposed (EA) or buried (Ea), positive main-chain phi angle, exposed (PA) or buried (Pa) and coil, exposed (CA) or buried (Ca). We counted how many times an amino acid from the mesophilic sequence in a given environment is substituted by another amino acid in the thermophilic sequences (or is conserved). The total number of substitution counts was 3,011,344 for archaea and 4,432,631 for eubacteria.
For a comparison, we used alignments of mesophilic proteins stored in HOMSTRAD and counted how many times an amino acid from the mesophilic sequence in a given environment is substituted by another amino acid in homologous mesophilic sequences (or is conserved).

Amino acid composition
We counted the occurrence of each amino acid to derive the compositions of thermophilic proteins (of archaeal and eubacterial origins) and compared them with that of mesophilic proteins (Fig. 1) using a modified version of SUBST (K. Mizuguchi, unpublished).
Charged amino acids, more precisely, Lys in eubacteria, Arg in archaea, and Glu in both kingdoms, are more abundant in thermophilic proteins than in their mesophilic counterparts. Interestingly, Asp and His are not more abundant in either thermophilic group. As already reported [14,16,17], in thermophilic proteins the higher percentage of charged amino acids is compensated by the lower percentage of polar, non-charged amino acids (Ser, Thr, Asn, and Gln). However, we observed subtle differences between eubacteria and archaea; compared to mesophiles, Asn and Ser are significantly under-represented only in thermophilic eubacteria. Gln is under-represented in both eubacteria and archaea but more strongly in archaea.
It was proposed that Asn and Gln are avoided in thermophilic proteins because of their chemical instability at high temperatures due to deamidation [16]. Yet the deamidation of proteins occurs primarily at Asn residues, except in very long-lived proteins, where Gln deamidation is also observed [21]. If the chemical instability at high temperatures were the sole cause of avoiding Asn and Gln, Asn should be under-represented more strongly than Gln in both eubacterial and archaeal thermophilic proteins. The observed under-reprentation of Asn (and Ser) only in eubacteria and that of Gln only in archaea requires an alternative explanation.
These differences may be explained by the proposal that processes other than selection due to biochemical properties of the amino acids affect the patterns of amino substitution between mesophiles and thermophiles [22]. In addition to biochemical properties and the G/C content of their codons, amino acids differ in their cost of uptake, synthesis or incorporation into proteins. If these bioenergetic costs vary among domains, different patterns of amino acid substitutison can be observed between different pairs of mesophiles and thermophiles.
The under-representation of Gln in archaea is consistent with its bioenergetics. Glutaminyl-tRNA synthase is absent in archaea but is present in some eubacteria, while asparaginyl-tRNA synthase is absent in some eubacteria and archaea. In the organisms without Gln-and Asn-tRNA synthases, the inclusion of Asn and Gln into proteins involves the formation of mis-acylated Asp-tRNA(Asn) or Glu-tRNA(Gln), and their subsequent amidation catalysed by amidotransferases [23]. In thermophilic archaea, which lack Gln-tRNA synthases, Gln appears to be underrepresented because of its instability at high temperatures Data were obtained from NCBI genome database [32] with the exception of G/C content and OGT of Geobacillus kaustophilus, which were taken from the DSMZ database of organisms (Braunschweg, Germany) [37]. and the cost of incorporating it into proteins. However, the previously reported negative correlation between the content of Gln and OGTs [24] still poses a question. The list of complete microbial genomes at NCBI currently contains only one fully annotated psychrotolerant archaeon (Methanococcoides burtonii). A proper explanation, therefore, awaits future investigation.
About aliphatic hydrophobic amino acids, we observe that Ala, Leu and Val are over-represented in thermophilic eubacteria, whereas in archaea only the beta branched amino acids, Ile, and to a lesser extent, Val, are over-represented. As already reported [12,18], thermo-labile Cys is under-represented in thermophiles of both archaeal and eubacterial origins. This suggests the possibility of a significant evolutionary pressure against Cys being conserved (assuming that thermophilic proteins evolved from mesophilic proteins) or being introduced (assuming that thermophilic proteins did not evolve from mesophilic proteins) unless it plays a structural (e.g., disulphide-bonded) or functional (e.g., metal-binding or catalytic) role. Trp is another potentially thermo-labile amino acid that is under-represented both in archaeal and eubacterial proteins.

Environment specific amino acid composition
We inferred the secondary structure and the accessibility of the thermophilic proteins by aligning them to the mesophilic proteins of known structure. In Table 2 we report the amino acid compositions in the different environments considered. Strictly speaking, we show the amino acid composition of the regions of thermophilic proteins that were aligned against residues of the mesophilic homologues in alpha helix (HA, exposed or Ha, buried), in beta strand (EA, exposed or Ea), in coil (CA, exposed or Ca, buried) or with positive phi angles (PA, exposed or Pa, buried). The environments with the smallest differences between mesophilic and thermophilic proteins are PA and Pa, where the preference for Gly was very high and no large differences were observed for the other amino acids.
Amino acid composition in percent Figure 1 Amino acid composition in percent. Bars in blue are for mesophilic, in green for thermophilic archaeal, and in yellow for thermophilic eubacterial proteins. Dots indicate values that significantly differ (P < 0.01) between thermophilic and mesophilic proteins.
For this reason the tables for residues with positive phi angles are not shown.
In general, the environments, in which we observed significant differences between thermophiles and mesophiles, are those exposed and in particular, exposed coils. In these environments, polar, non-charged amino acids are under-represented in thermophiles, whereas charged amino acids are over-represented.
Ion pairs stabilize proteins at high temperature more strongly than at low temperature [25][26][27] and desolvata- Mes stands for amino acid composition of mesophilic proteins, t_arc for amino acid composition of thermophilic archaeal proteins and t_eu for amino acid composition of thermophilic eubacterial proteins. HA stands for exposed alpha helices, Ha for non exposed alpha helices, EA for exposed beta strands, Ea for non exposed beta strands, CA for exposed coil and Ca for non exposed coil. The third letter is the standard code for amino acids. Values in bold significantly differ (P < 0.01) between thermophilic and mesophilic proteins.
tion energy is lower for exposed charges than for buried ones [28]. We suggest that a large number of exposed charged amino acids can stabilise proteins at high temperatures, because they are able to form extended networks of ion pairs.
Below, we report several specific observations. In archaeal alpha helices, we observed a significant increase of Ile accompanied by a decrease of Ala on the exposed surface and a decrease of Leu on the buried surface.
On the surface of beta strands, we noticed that archaea prefer Ile and eubacteria prefer Val, both amino acids being beta branched. Ile is also over-represented on the buried side of beta strands.
There are contradictory reports concerning Pro. Some researchers observed that Pro has an increased occurrence in thermophilic proteins especially in loops [14,29,30]. Others [12,16] found that the frequency of Pro was unchanged. Our data show that the frequency of Pro does not change significantly in general, except for a minor, albeit significant increase in exposed loops of eubacteria.

Environment specific substitution likelihoods
Amino acid composition can be a useful means to identify thermophilic organisms, but a more ambitious goal is to predict which substitutions are likely to change a mesophilic protein to a thermophilic one. The conservation of amino acid residues is strongly dependent on the environment in which they occur in the folded protein. Therefore, we calculated environment specific amino acid substitution likelihoods using a modified version of SUBST (K. Mizuguchi, unpublished). For each environment we calculated 20 × 20 substitution likelihoods. Each value represents the likelihood of occurrence and acceptance of a mutational event of a residue in the mesophilic sequence and in a particular structural environment, leading to any other residue in the thermophilic sequences. We compared these values with those representing the likelihood of occurrence and acceptance of a mutational event of a residue in the mesophilic sequence and in a particular structural environment, leading to any other residue in the mesophilic sequences. We show a list of statistically significant cases, in which the likelihood of a substitution leading from a mesophilic protein to a thermophilic archaeal protein or to a thermophilic eubacterial protein is different from the corresponding environment specific amino acid substitution in mesophilic proteins. For the sake of simplicity, in Table 3 (for archaea) and Table 4 (for eubacteria) we only show cases in which the difference is statistically significant (P < 0.01) and large (|Δ| > 2). All statistically significant cases are also provided in additional files 1 and 2.
As already observed in Table 2, major differences between thermophiles and mesophiles are observed in exposed environments. The substitutions that more frequently lead from mesophilic proteins to thermophilic proteins are those of polar, non-charged amino acids with Glu and Lys (in archaea) or with Arg (in eubacteria). In archaea, we also observe frequently the substitution of non-polar amino acids with Ile. The role of Ile is striking, since more than one third of the substitutions that lead from mesophilic to thermophilic archaeal proteins involve this amino acid. Substitutions of hydrophobic amino acids with Ile are highly frequent, in particular in the environment of exposed alpha helices. Ile is generally preferred to the gamma branched Leu, even in alpha helices and to the smaller beta branched Val. No hydrophobic amino acid has such prevalence in the case of eubacterial thermophilic proteins. Since the average nucleotidic composition does not differ significantly in the genomes of the archaea and eubacteria considered (Table 1), the abundance of Ile cannot be explained only by the fact that it is coded by triplets very rich in A/T (ATA, ATT and ATC).

Conclusion
One reason to study naturally occurring thermostable proteins is to learn how mesophilic proteins of biotechnological interest can be stabilised. In this context, it is reassuring to observe that differences between thermophilic and mesophilic proteins occur primarily in solvent accessible surfaces. This suggests a possible strategy for enhancing the thermal stability of proteins: mutagenesis of exposed residues is in fact usually better tolerated by proteins, whereas mutagenesis of buried residues, even when rationally designed, can often lead to the misfolding of the protein of interest. By calculating the likelihood of substitutions that lead from mesophilic to thermophilic proteins, a simple and potentially useful trend for biotechnology was recognised in archaea, where polar, noncharged amino acids are preferentially substituted by Glu and Lys and non-polar amino acids by Ile.
Considering substitutions that lead from mesophilic to thermophilic proteins, we refer only to the fact that we aligned thermophilic proteins to their mesophilic homologues of known structure; by no means we want to imply that thermophilic proteins have evolutionarily derived from mesophilic proteins (or vice versa). Thermophiles are located at the deepest positions within the phylogenies of both prokaryotic domains. This observation led to the hypothesis of the hot origin of life but the matter is complex and still disputable [31]. Our data suggest that different strategies for thermal adaptation might have been exploited by archaea and eubacteria.

Methods
Protein sequences for 10 thermophilic archaeal and seven thermophilic eubacterial genomes, as well as their GC content and optimal growth temperatures (OGTs), were obtained from the NCBI genome site [32]. These were the only thermophiles whose genomes had been completed and stored in this database at the time of investigation; we arbitrarily chose one species when two or more organisms belonging to the same genus were available.
The dataset of mesophilic structures was created from HOMSTRAD [19] available at HOMSTRAD site [33].
Each sequence in the dataset of mesophilic proteins was used as a query to search separately against the databases of thermophilic archaeal or eubacterial sequences. We performed gapped BLASTP searches in PSI-BLAST mode with the BLASTPGP [20] program using the following parameters: j, the maximum number of rounds was set to 2, h, the e-value threshold for including sequences in the score matrix model, was set to 0.000000001 and e, the final e-value was set to 0.000001. The same program produced the alignment of the query mesophilic sequence with its thermophilic homologues. Mes stands for mesophilic proteins, t_arc for thermophilic archaeal proteins, HA for exposed alpha helices, Ha for non exposed alpha helices, EA for exposed beta strands, Ea for non exposed beta strands, CA for exposed coil and Ca for non exposed coil. Data are shown only if P < 0.01 in the two-tailed t-test and if the difference between mes and t_arc are, in absolute value, larger than 2. Environment specific amino acid substitutions with higher likelihood values in mesophiles-thermophilic than in archaeal homologues are in italics, those with higher likelihood values in mesophilesmesophiles homologues are in bold. Data are sorted by increasing differences between mes and t_arc.
The first sequence of each alignment thus produced was a mesophilic protein of known 3D structure and its secondary structure/main chain conformational states and solvent accessibility were calculated by JOY [34]. Residues with side-chain relative accessibility higher than 7% were defined as accessible, otherwise inaccessible.
We modified (K. Mizuguchi, unpublished) the program SUBST available at SUBST site [35], which had been used to derive the environment specific substitution tables for the homology recognition software FUGUE [36]. The modified version of SUBST can now count amino acid substitutions between a protein of known structure and its homologous sequences. Observed amino acid replacements at aligned positions were counted in terms of the local environment of the first sequence (i.e., the mesophilic protein of known 3D structure). Let F EN MT be the number of times the amino acid M of the mesophilic protein in the environment EN was replaced in thermophilic proteins by the amino acid T. The raw substitution counts were converted into substitution frequencies ENMT as: ENMT= F EN MT /∑ t F EN Mt . E generically refers to secondary structure/main chain conformation; specifically 'H' indicates alpha helices, 'E' beta strands, 'P' residues with a positive phi angle and 'C' coils. N generically refers to solvent Mes stands for mesophilic proteins, t_eu for thermophilic eubacterial proteins, HA for exposed alpha helices, Ha for non exposed alpha helices, EA for exposed beta strands, Ea for non exposed beta strands, CA for exposed coil and Ca for non exposed coil. Data are shown only if P < 0.01 in the two-tailed t-test and if the difference between mes and t_eu are, in absolute value, larger than 2. Environment specific amino acid substitutions with higher likelihood values in mesophiles-thermophilic than in eubacterial homologues are in italics, those with higher likelihood values in mesophilesmesophiles homologues are in bold. Data are sorted by increasing differences between mes and t_eu.
accessibility; specifically 'A' indicates accessible sidechains and 'a' inaccessible side chains.
Two-tailed t-tests for independent samples were carried out to identify statistically significant (P < 0.01) differences between the values calculated for thermophilic proteins and the reference mesophilic proteins. The alignments built for archaeal proteins were randomly divided into four sets. For each set, environment specific amino acid compositions and substitutions were calculated. The means of these values were calculated to produce the final results. Similarly, the alignments built for eubacterial proteins and the control alignments of mesophilic proteins were each divided into four sets and the mean values of the amino acid compositions/substitutions were calculated. Differences between the means of two groups (e.g., thermophilic archaea and mesophiles) were then tested (with six degrees of freedom).