The high percentage of the variance explained by the [1,1,1] factor combination shows that the main determinant of the amino acid composition of proteins is independent of the protein function or organism to which they belong. The different uses of amino acids may therefore be due to differences in several biochemical characteristics. However, which amino acid properties influence their usage in proteins is still unknown. Moreover, Jordan et al.  have shown that the amino acid composition of proteins is not in equilibrium. By comparing sets of orthologous proteins of closely related genomes from 15 species representing the three domains of life and comparing the fluxes of reciprocal substitutions caused by single-nucleotide replacements, these authors found that cysteine, methionine, histidine, serine and phenylalanine are strong 'gainers' (i.e. their frequency is increasing), and proline, alanine, glutamate and glycine are strong 'losers' (i.e. their frequency is decreasing) . Except for methionine, gainers tend to be under-represented and losers are over-represented . This loser-rich and gainer-poor amino acid composition may be due to the order in which amino acids were recruited into the genetic code . The correlation between the general amino acid frequencies that we observe (the projection of the amino acids onto the x axis in figure 1) and the rate of gain or loss defined by Jordan et al.  is only -0.39. The correlation, however, is -0.68 when we compare the general amino acid frequencies and a consensus chronology of incorporation of amino acids into the genetic code defined by Trifonov . This relatively high correlation value means that the order of recruitment of the amino acids into the genetic code can be an additional factor that influences the different use of the amino acids. However, because Trifonov used the amino acid composition of extant proteins as one of the 60 criteria to obtain his consensus chronology of amino acids , the above correlation is not unexpected and must be interpreted with caution.
In addition to this general amino acid composition, there are obviously differences in the amino acid composition of proteins due to the function or organism to which they belong. The difference between ribosomal and non-ribosomal proteins is the main factor behind the amino acid usage within species in the data set we analyzed. Shape and charge complementarity rather than sequence-specific interactions are responsible for the specific interactions of most ribosomal proteins with RNA . Because of these interactions, ribosomal proteins prefer positively charged amino acids and avoid negatively charged ones . The mapping of conserved arginines and lysines onto the ribosome structure has revealed that these charged residues frequently form surface patches that reflect RNA-binding sites . The ribosomal proteins L10, L29, S2 and S14, however, do not cluster with the other ribosomal proteins in figure 2. S2 and L10 appear in the non-ribosomal cluster, and S14 and L29 have a slightly different amino acid composition from the other proteins (see figure 2). These differences may be due to the characteristics of these ribosomal proteins in the position or role when the ribosome is formed. Although S2 is one of the largest ribosomal proteins in the 30S subunit, it is very loosely attached to this subunit (only seven out of 236 residues contact with the rRNA) and has the lowest percentage of arginine and lysine in it . It is not unusual, therefore, for S2 to clusters with non-ribosomal proteins. With approximately 60 amino acids, S14 and L29 are the smallest ribosomal proteins in the data set. The short sequence of these proteins influences their amino acid composition and both appear as outliers in figure 2. However, projection of the S14 protein onto the horizontal axis of figure 2 shows that, despite its short length, this protein has some characteristics of the majority of ribosomal proteins. S14 is completely devoid of any globular domain, and most of the protein has an extended coil structure . Although S14 is involved in intimate protein-protein interactions, almost its entire length is involved in RNA contacts and its arginine and lysine contents are similar to those of most ribosomal proteins . S14 is therefore indistinguishable from most ribosomal proteins in the x-axis projection of figure 2. On the other hand, L29 interacts with the L23 protein and with only one of the six domains of 23S rRNA . This characteristic, and its short length, may therefore explain the position of L29 in figure 2.
G+C content and optimal growth temperature are the two factors that most influence differences in amino acid composition between organisms. Analysis of the optimal temperatures of the enzymes extracted from hyperthermophilic organisms showed that thermal resistance was an intrinsic property of these enzymes . Comparative analysis of the amino acid composition of orthologous proteins from several mesophilic and thermophilic organisms indicated some amino acid substitutions that are preferred in thermophiles . However, the small number of sequences analyzed and the fact that factors other than temperature can affect the amino acid composition of proteins revealed the inconsistency of theses results . Comparison of the first completely sequenced genomes of several thermophiles and mesophiles showed that proteins from thermophiles contain higher levels of both charged and hydrophobic residues and lower levels of polar and uncharged ones . Once more complete genomes were sequenced, new analyses were performed using different methods and different datasets [8–10, 21–25]. Although these studies show several discrepancies in the role of each amino acid, there is a consensus that glutamate (E) and, to a lesser extent, valine (V) are the amino acids that are more represented in thermophiles than in mesophiles. These were also the amino acids that were most represented in thermophiles when our method was used.
There are greater discrepancies, however, over which amino acids are used with the lowest frequency in thermophiles or with the highest frequency in mesophiles. For example, Singer and Hickey  found that these amino acids were A, H, Q and T; Kreil and Ouzounis  found that they were Q and T; and Tekaia and coworkers  found only Q. These discrepancies indicate that hyperthermophilic and mesophilic enzymes may be very similar – their difference being that hyperthermophilic enzymes are more rigid than mesophilic enzymes . To increase their rigidity, hyperthermophilic enzymes may adopt several strategies but a common rule could be that more charged residues are found in hyperthermophilic proteins, mostly at the expense of uncharged polar residues . Computational, biochemical, and structural evidence now supports the hypothesis that ion pair formation, hydrogen bonds, and hydration, rather than hydrophobic interactions, play important roles in the stabilization of enzymes from extremophiles . Also, we cannot talk of a common amino acid usage in mesophiles because an adaptation to live at intermediate temperatures is unnecessary. When comparing the amino acid compositions of thermophilic and mesophilic proteins, therefore, different datasets and methods obtain different results.
The use of certain amino acids with higher or lower frequencies in thermophiles is important for the thermal stability of their enzymes. However, other factors may contribute to survival at high temperatures. Thermophilic archaea, for example, may be protected by their unique membrane lipids, the use of a reverse gyrase that introduces positive supercoils , a DNA repair system [28, 29] and the presence of special DNA-binding proteins . One of these thermophilic-specific proteins may be highly basic histone-like proteins that wind and compact DNA into a nucleosome-like structure and thus protect them from heat denaturation . Loss of some of these factors may lead to a lesser ability to grow at high temperatures. This could be the case of the Euryarcheota Halobacterium sp (Hbs) and M. acetivorans (Mac), two archaea whose optimal growth temperature is below 40°C but that cluster with other thermophilic species in figure 3. The amino acid compositions of these two Euryarchaeota, which are very similar to those of other thermophiles, may be a trace of their past ability to grow at high temperatures. A thermophile-specific NTPase found in 13 thermophilic genomes and absent in 52 mesophilic genomes is present in M. acetivorans . This suggests that M. acetivorans facultatively could be thermophilic . Although the phylogenetic position of these two archaea and our analysis of the amino acid composition suggest a recent transversion to mesophily in Halobacterium sp and M. acetivorans, this hypothesis is speculative and needs to be supported by stronger evidence. In this sense, it would be useful to identify proteins present in all thermophilic Euryarchaeota but not in mesophilic Euryarchaeota. One of these proteins is a dsDNA-binding protein called Alba (short for "acetylation lowers binding affinity"), which is present in several thermophilic archaea but not in Halobacterium sp or M. acetivorans . The correlation of Alba with growth at high temperatures hints at a role for Alba in DNA protection and stability under these conditions . Interestingly, it has been suggested that this protein constrains negative DNA supercoils in a temperature-dependent fashion, which suggests that it may function in chromosomal organization and accessibility .
The relationship between genomic G+C content and optimal growth temperature in prokaryotes has been debated recently in the literature [34–37]. Because G:C pairs in DNA are more thermally stable than A:T pairs, it has been suggested that a high G+C content may be a selective response to high temperature. In this sense, a significant correlation has been observed between optimal growth temperature and the G+C content of structural RNAs [35, 36]. When open reading frames are analyzed, some studies have concluded that there is no correlation between G+C content and optimal growth temperature [34–36] and others have found a positive correlation among some families of prokaryotes . If this correlation exists, it could be argued that the G+C-content dependence observed in the amino acid composition of prokaryotes is a consequence of their thermophily-dependence. In our dataset, the G+C content at the third codon position and the optimal growth temperature do not correlate significantly (r = 0.081). In addition, the results obtained with the tucker3 algorithm indicate that these two variables are independent. The amino acid variation associated with G+C content and optimal growth temperature corresponds to the second and fifth factor of the amino acid loadings matrix, respectively. Because the principal components obtained with the tucker3 model were constrained to be orthogonal, it can be concluded that the two factors are independent. The correlation observed therefore for the second and fifth factors of the amino acid loadings matrix is only 6.03E-5. Similar arguments can be applied to the second and third factors (those associated with differences in G+C content and optimal growth temperature, respectively) of the organism's loadings matrix. Moreover, the amino acid preferred by thermophiles is glutamate, which is an amino acid encoded by intermediate-GC content codons (GA [A,G]). All this evidence suggests that the amino acid variations related to variations of G+C content and optimal growth temperature are independent and that the observed G+C-dependence is not a consequence of a thermophily dependence.