The canonical genetic code is not universal although it is present in most complex genomes. Its establishment is still under discussion once the discovery of non-standard genetic codes altered the "frozen accident" [1]. Woese [2] was one of the first to consider the adaptability of the genetic code. He suggested that the patterns within the standard genetic code reflect the physicochemical properties of amino acids. An argument in favor is the fact that in the canonical genetic code the amino acids with similar chemical properties are coded by similar codons.
There are three basic theories on the origin of the organization of the genetic code [3]. The stereochemical theory claims that the origin of the genetic code must lie in the stereochemical interactions between anticodons or codons and amino acids. The second one is the physicochemical theory, which claims that the force defining the origin of the genetic code structure was the one that tended to reduce the deleterious effects of physicochemical distances between amino acids codified by codons differing in one base. The third one is the coevolution hypothesis [4, 5], which suggests that the structure of the genetic code reflects the biosynthetic pathways of amino acids through time and the error minimization at the protein level is just a consequence of this process. This coevolution theory suggests that codons, originally assigned to prebiotic precursor amino acids, were progressively assigned to new amino acids derived from the precursors as biosynthetic pathways evolved. For other authors as Higgs [6], the driving force during the build-up of the standard code is not the minimization of the effects of translational error, and the main factor that influenced the positions in which new amino acids were added is that there should be minimal disruption of the protein sequences that were already encoded. Nevertheless, the code that results is one in which the translational error is minimized.
Several previous works have studied the genetic code optimality. Most authors have quantified the efficiency of a possible code taking into account the possible errors in the codon bases. Generally, a measurement of changes in a basic property of the codified amino acids was used considering all the possible mutations in a generated code. The most efficient code is one that minimizes the effects of mutations, as this minimization implies a smaller phenotypic change in the codified proteins.
Once the efficiency of a code has been measured, different criteria are used to assess whether the genetic code is in some sense optimal. These analyses fall into two main classes: statistical [7] and engineering [8]. The first one considers the probability of random codes more efficient than the standard genetic code. With this alternative for measuring code optimality, the standard genetic code is compared with many randomly generated alternative codes. These considerations define the so-called "statistical approach" [7]. Comparing the error values of the standard genetic code and alternative codes indicates, according to the authors using this approach [9–13], the role of selection. The main conclusion of these authors is that the genetic code conserves amino acid properties far better than expected from a random code.
In a first computational experiment with this alternative, Haig and Hurst [12] corroborated that the canonical code is optimized to a certain extent. They found that of 10,000 randomly generated codes, only two performed better at minimizing the effects of errors, when polar requirement [2] was taken as the amino acid property, concluding that the canonical code was a product of natural selection for load minimization. Freeland and Hurst [9] investigated the effect of weighting transition errors differently from transversion errors and the effect of weighting each base differently, depending on reported mistranslation biases. When they used weightings to allow for biases in translation, they found that only one in every million randomly generated alternative codes was more efficient than the standard genetic code.
With a similar methodology, Gilis et al. [14] took into account the frequency at which different amino acids occur in proteins and found that the fraction of random codes that beat the canonical code decreases. Torabi et al. [15] considered both relative frequencies of amino acids and relative gene copy frequencies of tRNAs in genomic sequences which were used as estimates of the tRNA content [16]. Zhu et al. [17] included codon usage differences between species and Marquez et al. [18] tested the idea that organisms optimize their codon usage as well as their genetic code: codons with lower error values might be used in preference to those with higher error values, to reduce the overall probability of errors, although their conclusions were the opposite.
Sammet et al. [19], using a genotype-to-phenotype mapping based on a quantitative model of protein folding, compared the standard genetic code to seven of its naturally occurring variants with respect to the fitness loss associated to mistranslation and mutation. According to the authors' methodology, most of the alternative genetic codes were found to be disadvantageous to the standard code, that is, the standard code is generally better able to reduce the translation load than the naturally occurring variants.
The second alternative for measuring code optimality is the so-called "engineering approach", followed, for example, by Di Giulio [8, 20]. The approach uses a "percentage distance minimization" (p.d.m.) which compares the standard genetic code with the best possible alternative. The p.d.m. determines code optimality on a linear scale, as it is calculated as the percentage in which the canonical genetic code is in relation to the randomized mean code and the most optimized code. Therefore, it is defined as (∆
mean
- ∆
code
)/(∆
mean
- ∆
low
), where ∆
mean
is the average error value, obtained by averaging over many random codes, and ∆
low
is the best (or approximated) ∆ value. This approach tends to indicate that the genetic code is still far from optimal.
With this methodology, Di Giulio [21] estimated that the standard genetic code has achieved 68% minimization of polarity distance, by comparing the standard code with random codes that reflect the structure of the canonical code and with the best code that the author obtained by a simulated annealing technique. The author indicates that the minimization percentage can be interpreted as the optimization level reached during genetic code evolution. With this methodology, the authors in [22] also considered the evolution of the code under the coevolution theory. We previously analyzed the evolution of codes within the coevolution theory [23].
We used the mean square (MS) measurement [9, 12] (Methods Section) to quantify the relative efficiency of any given code. We considered two possibilities to generate alternative codes: the first one is the model of hypothetical codes that reflects the current genetic code translation table (model 1), which is most used in the literature. Two restrictions were considered [9, 12]:
-
1.
The codon space (the 64 codons) was divided into the 21 nonoverlapping sets of codons observed in the standard genetic code, each set comprising all codons specifying a particular amino acid in the standard code.
-
2.
Each alternative code is formed by randomly assigning each of the 20 amino acids to one of these sets. The three stop codons remain invariant in position, these being the same stop codons of the standard code.
This choice of a small part of the vast space of possible codes, with these conservative restrictions, as Novozhilov et al. [24] indicate, "is based on the notion that the block structure of the standard code is a consequence of the structure of the complex between the cognate tRNA and the codon in mRNA where the third base of the codon plays a minimum role as a specificity determinant".
As the codon set structure of the standard genetic code is unchanged, only considering permutations of the amino acids coded in the 20 sets, there are 20! (2.43·1018) possible hypothetical codes. Without restrictions in the mapping of the 64 codons to the 21 labels there would be more than 1.51·1084 general codes [25].
In this work we considered the commented restrictive codes. Nevertheless, as Higgs [6] indicates, none of the known examples of codon reassignment occurs by swapping the amino acids assigned to two codon blocks. Instead, one or more codons assigned to one amino acid are reassigned to another, so one block of codons decreases in size while the other increases. Furthermore, the amino acid that acquires the codon is almost always a neighbor of the one that loses it. As Higgs [6] states, "The reason for this is that reassignments of codons to neighbouring amino acids can be done by changing only a single base in the tRNA anticodon". Hence, we also studied a second alternative with these possible restricted hypothetical codes which consider these codon reassignments (model 2), model not considered in the previous literature.