Mining prokaryotic genomes for unknown amino acids: a stop-codon-based approach
© Fujita et al. 2007
Received: 13 March 2007
Accepted: 28 June 2007
Published: 28 June 2007
Skip to main content
© Fujita et al. 2007
Received: 13 March 2007
Accepted: 28 June 2007
Published: 28 June 2007
Selenocysteine and pyrrolysine are the 21st and 22nd amino acids, which are genetically encoded by stop codons. Since a number of microbial genomes have been completely sequenced to date, it is tempting to ask whether the 23rd amino acid is left undiscovered in these genomes. Recently, a computational study addressed this question and reported that no tRNA gene for unknown amino acid was found in genome sequences available. However, performance of the tRNA prediction program on an unknown tRNA family, which may have atypical sequence and structure, is unclear, thereby rendering their result inconclusive. A protein-level study will provide independent insight into the novel amino acid.
Assuming that the 23rd amino acid is also encoded by a stop codon, we systematically predicted proteins that contain stop-codon-encoded amino acids from 191 prokaryotic genomes. Since our prediction method relies only on the conservation patterns of primary sequences, it also provides an opportunity to search novel selenoproteins and other readthrough proteins. It successfully recovered many of currently known selenoproteins and pyrrolysine proteins. However, no promising candidate for the 23rd amino acid was detected, and only one novel selenoprotein was predicted.
Our result suggests that the unknown amino acid encoded by stop codons does not exist, or its phylogenetic distribution is rather limited, which is in agreement with the previous study on tRNA. The method described here can be used in future studies to explore novel readthrough events from complete genomes, which are rapidly growing.
Stop codon readthrough is a phenomenon in which the translation process does not terminate at a stop codon, and an amino acid is inserted there instead [1, 2]. In some cases, the inserted amino acid is not one of the 20 amino acids but a noncanonical one. Two such amino acids have been discovered to date: selenocysteine [3, 4] and pyrrolysine [5, 6]. Because each of them have specialized tRNA genes for decoding and can be considered extensions of the standard genetic code, they are called the 21st and 22nd amino acids, respectively. Selenocysteine, the 21st amino acid, is encoded by stop codon UGA, and organisms that use selenocysteine have been found from all three domains of life. Its insertion into UGA is directed by SECIS (selenocysteine insertion sequence) elements, a stem-loop structure on the selenoprotein mRNA. Along with the progress of genome sequencing projects, computational prediction methods of selenocysteine-containing proteins (selenoproteins) have been developed by several research groups [7–10], and the repertoire of selenoproteins has been greatly expanded [11, 12]. Pyrrolysine, the 22nd amino acid encoded by stop codon UAG, was recently discovered from a methanogenic archaea [5, 6]. Currently, only methanogenic archaea of the order Methanosarcinales and one bacterium are considered to utilize pyrrolysine . The limited phylogenetic distribution of pyrrolysine suggests that its incorporation into the genetic code of methanogen is relatively recent, and the insertion mechanism of a novel amino acid can evolve in a shorter period of time than anticipated.
This raises an interesting question: "Is there a 23rd amino acid?" If such an amino acid is discovered, it will deepen our understanding of the evolution and diversity of the genetic code. Because genome sequences of various prokaryotes are available today, there will be a chance to discover the novel amino acid via analysis of these genomes. Since both the 21st and 22nd amino acids are encoded by stop codons, the prime suspect is other stop codons (e.g. stop codon UAA), although the possibility of sense codons certainly remains. Using this clue, computational screening methods of the 23rd amino acid can be designed. Recently, Lobanov et al. addressed this problem by searching tRNAs with anticodons corresponding to stop codons . They analyzed 146 prokaryotic genomes, but no likely tRNA of the novel amino acid was detected. They concluded that the 23rd amino acid would have a limited phylogenetic distribution, if it exists.
However, programs for tRNA identification are based on the features of known tRNAs and do not necessarily perform well on unknown ones. Actually, tRNASec and tRNAPyl have unusual secondary structures [5, 15] and often escape detection by programs without special consideration. Lobanov et al. thus developed a sensitive search method to deal with this problem, but they also admitted that it would fail to identify highly unusual tRNAs. There is another approach to searching for the 23rd amino acid. By enumerating ORFs that have an inframe stop codon from genomes and examining their evolutionary conservation, candidate proteins can be predicted. Because such an ORF-based study is independent from the tRNA analysis, it can either identify candidate organisms missed by the previous study or strengthen its negative conclusion.
Here we report a comprehensive analysis of prokaryotic ORFs that contain an inframe stop codon. Through enumeration of theoretical ORFs and inspection of their evolutionary conservation, candidates of readthrough proteins were predicted. They contained many of the known proteins with stop-codon-encoded amino acids, but almost no novel candidates were identified. Therefore, the unknown amino acid, if it is encoded by a stop codon, is unlikely to exist in the current databases of microbial genomes. The consequences for selenoproteins and other readthrough genes are also discussed.
To discriminate them from true readthrough genes, evolutionary information was exploited. In order to eliminate pseudogenes and sequencing errors, conservation of iORFs and their inframe stop codons was examined. Since pseudogenes are less conserved, and sequencing errors are relatively rare events, they will not have homologous iORFs in other species. Even if they do, the position or type (UAA, UAG or UGA) of their inframe stop codons will not coincide. In this way, they can be eliminated as candidates. A drawback of this criterion is that it limits the target of our study to readthrough genes conserved across two or more species. In other words, species-specific readthrough genes are not in the scope of this study.
To address the second problem, adjacent gene pairs were filtered out by examining boundaries of sequence alignments between iORFs and its homologs (Figure 1b). The stop-codon-encoded amino acids of prokaryotes are usually located inside domains, the units of evolutionary sequence conservation. Therefore, the aligned regions of readthrough proteins contain their inframe stop codon. Based on this observation, each iORF was required to have: (i) at least one homolog from other organisms that covers the inframe stop codon and (ii) no homolog that does not cover the stop codon. Note that, however, if the whole length of an iORF was used as a query sequence, this procedure will erroneously discard multidomain readthrough proteins. To avoid this problem, a partial sequence around the inframe stop codon was used as a query.
As noted above, the target of this study is evolutionarily conserved iORFs. Thus, it was examined whether the iORFs have homologous regions in other genomes. The 390,926 iORFs were translated into amino acid sequences and subjected to TBLASTN  against the 191 genome sequences. Instead of the whole length of the amino acid sequence, a window of 101 residues centered at the inframe stop codon was used as a BLAST query. After the BLAST searches, iORFs that have at least one interspecific hit that contains the inframe stop codon were collected. Whether the codon aligned to the inframe stop codon is a nonsense codon or not was neglected at this stage. There were 94,690 iORFs that have interspecific hits. The result of the above homology searches was also used for the boundary analysis (Figure 1b). An iORF was discarded if there were any BLAST hits that do not cover the inframe stop codon. A total of 26,003 iORF satisfied the above criteria.
To examine intrafamily conservation of the inframe stop codons, these iORFs were clustered into protein families based on sequence similarity. After removal of singletons, 679 clusters with two or more members were obtained. A cluster was discarded unless all members of the cluster had the same type of inframe stop codons (UAA, UAG or UGA). The locations of the inframe stop codons were also required to be identical in the multiple sequence alignment of the cluster members. These conditions reduced the number of clusters to 273.
Manual inspection of these 273 clusters revealed that they still contain many false positives that are unrelated to stop-codon-encoded amino acids. Hence, three-step filtering procedures were applied to remove the false positives. Briefly, the first filter assesses protein-likeliness based on the signal of purifying selection, while the second and third filters try to remove adjacent gene pairs using the pattern of BLAST alignments (for details, see Materials and Methods). As a result of the filtering, the number of candidate clusters was reduced to 32. Through manual inspection of the BLAST alignments, 11 clusters were discarded because they are highly unlikely to code readthrough proteins.
Predicted clusters of readthrough proteins
Example organism (locus)
Formate dehydrogenase α subunit
Escherichia coli (b1474)
Selenide water dikinase
Haemophilus influenzae (HI0200m)
Glycine reductase complex selenoprotein A
Treponema denticola (TDE0745)
Glycine reductase complex selenoprotein B
Treponema denticola (TDE0078)
Heterodisulfide reductase subunit A
Methanococcus jannaschii (MJ1190m)
Coenzyme F420-reducing hydrogenase δ subunit
Methanococcus jannaschii (MJ1190a)
Formylmethanofuran dehydrogenase subunit B
Methanococcus jannaschii (MJ1194m)
Carboxydothermus hydrogenoformans (CHY_0740)
Geobacter sulfurreducens (GSU3446)
Coenzyme F420-reducing hydrogenase α subunit
Methanococcus jannaschii (MJ0029)
Desulfovibrio vulgaris (DVU_1382)
Methanococcus maripaludis (MMP0252 + upstream)
Desulfotalea psychrophila (DP1009)
Desulfovibrio desulfuricans (Dde_1263 + upstream)
Periplasmic [NiFeSe] hydrogenase large subunit
Desulfovibrio vulgaris (DVU_1918)
Methanosarcina acetivorans (MA0144)
Methanosarcina acetivorans (MA0532)
Methanosarcina acetivorans (MA0528)
Transcriptional regulator, TetR family
Methanosarcina acetivorans (MA2902)
Cytochrome c family protein
Geobacter sulfurreducens (GSU2937 + GSU2936)
Geobacter sulfurreducens (GSU2293 + downstream)
Of the four pyrrolysine proteins detected, three methylamine methyltransferases have been experimentally confirmed to contain pyrrolysine [6, 17]. The rest is a cluster of TetR-like transcriptional regulators from Methanosarcina acetivorans and M. barkeri. Since the genome annotation of M. acetivorans describes this protein as a gene containing an inframe amber codon, we classified it as a 'known' candidate, although it is still unclear whether it really contains pyrrolysine. The genome annotation of M. acetivorans also includes several amber-containing genes that were absent from our prediction result. They are a methlycobamide:CoM methylase and four transposases . The reason why they were not detected is that only one species in our dataset had an amber-containing form of these proteins. This is unavoidable because of the inability of our method to detect species-specific readthrough events. It is the price for reliably excluding pseudogenes and sequencing errors.
We expect that this cluster may represent a novel selenoprotein family. This is because the inframe stop codons of these proteins are exclusively TGA, and all of the above organisms possess selenocysteine insertion machinery (data not shown). High conservation of residues near the inframe stop codon also suggests the importance of this region. If they are true selenoproteins, this protein family becomes a rare instance of selenoprotein that lacks non-selenocysteine homologs. However, computational analysis of sequences immediately downstream of the inframe stop codons failed to identify SECIS elements, which is a hallmark of selenocysteine-containing genes. Therefore, yet another possibility is that they are a highly conserved operon. An experimental verification is necessary to distinguish these two possibilities.
The second cluster consists of two hypothetical proteins, again from G. sulfurreducens and G. metallireducens (Figure 4b). In contrast to the first cluster, no homolog was identified from other species. This cluster is probably a false positive and not readthrough proteins. This is because the residues near the inframe stop codons are poorly conserved. Moreover, the C-terminal extensions are quite short (about 20 aa). The sequence conservation in this region can be easily explained by the close phylogenetic relationship between the two species. In summary, although a possible selenoprotein was newly identified, there was no promising candidate for an unknown amino acid encoded by a stop codon.
The above negative result could be explained if the filtering process, which is the final step of the prediction method (Figure 2), was too strict. Although the raw output of the search for evolutionarily conserved iORFs was 273 clusters, most of them were discarded at the subsequent filtering stage. Because we have no a priori knowledge about the 23rd amino acid, cutoff thresholds for the filtering procedures were determined based on the known readthrough proteins. This is practically indispensable for objective classification of candidates, but there is no guarantee that unknown proteins with the 23rd amino acid will score higher than the thresholds.
When top ten organisms were examined, only Gluconobacter oxydans was an organism not known to have stop-codon-encoded amino acids. An inspection of the G. oxydans iORFs in the 273 clusters revealed that their inframe stop codons are dominated by TAA, but all of them belong to a single protein cluster associated with transposable elements. Because it seems unlikely that an insertion system of novel amino acid evolves solely for transposable elements, this organism cannot be considered as a good candidate of the 23rd amino acid. Sensitivity of this test is not high because many organisms that utilize selenocysteine were below the defined threshold. However, the result agrees with the filtering-dependent analysis that no candidate of the novel stop-codon-encoded amino acid is detectable in the current dataset.
As the number of completely sequenced genomes increases, several research groups started to predict proteins that contain stop-codon-encoded amino acids through computational analyses. Most of them are aimed at identification of selenoproteins, reflecting concerns from the scientific community and accumulated knowledge on selenocysteine. In order to improve prediction specificity, they have fully exploited the known features of selenocysteine, such as the SECIS elements or cysteine homologs, which have cysteine in place of selenocysteine. However, since the target of this study is the 23rd amino acid, and there is no a priori knowledge, only general properties of stop-codon-encoded amino acids can be used for prediction. Such general-purpose algorithms have also been developed to date. The method of Chaudhuri and Yeates  extracts iORFs from microbial genomes and analyzes sequence conservation around the inframe stop codon. Their method is thus similar to ours and applicable to both selenocysteine and pyrrolysine. Perrodou et al.  constructed a database of predicted recoding events in microbes. Their method is applicable not only to stop codon readthrough but also to frameshift.
However, both of them did not apply their methods to search for novel amino acids. Therefore, the question of the 23rd amino acid has not been investigated from the viewpoint of coding sequences. Additionally, the previous methods cannot effectively discriminate pseudogenes from readthrough genes. For instance, Chaudhuri and Yeates reported a homolog of cobalamin biosynthesis protein CobN as a novel candidate of pyrrolysine protein. However, the gene is probably a pseudogene because it contains an inframe TAA codon in addition to the TAG codon, and only one species seems to have the amber-containing form of the gene.
The previous methods also assume that proteins with stop-codon-encoded amino acids will have non-readthrough homologs (i.e., homologous proteins that do not have inframe stop codons). However, that is not necessarily true. For example, pyrrolysine-containing monomethylamine methyltransferases adopt TIM barrel fold , but their primary sequences do not exhibit detectable similarity to other TIM barrel proteins because of evolutionary divergence. Dimethylamine methyltransferases also lack non-readthrough homologs. Yet another example is glycine reductase selenoprotein A. Only the selenocysteine-containing form of the enzyme is currently known . Therefore, it is important not to assume non-readthrough homologs for exploring novel candidates. If any non-readthrough homologs are registered in public sequence databases, a careful annotation process of a newly sequenced genome will be able to detect readthrough genes, even though they may be annotated as pseudogenes. However, if all members of a gene family have stop codon readthrough, correct annotation of their gene structure will be extremely difficult, and all of them will be split into two distinct genes.
The method reported here is unique in that it does not assume non-readthrough homologs. Using this method, a systematic screening of the 23rd amino acid and other readthrough genes was carried out. Many of the currently known selenoproteins and pyrrolysine proteins were recovered, indicating the effectiveness of this approach. In particular, successful detection of pyrrolysine-containing methyltransferases and selenoprotein A should be noted. However, almost no novel candidates for readthrough genes were predicted. What can be concluded from this result? The most likely explanation is that the 23rd amino acid does not exist, or its distribution on the tree of life is rather limited. Although a broad spectrum of taxonomic groups has been subjected to genome sequencing, the genomes of most microbial species on the earth have yet to be determined. The unknown amino acid may be used by these species. Alternatively, only one organism in our dataset may have the 23rd amino acid. This is because our method is limited to readthrough genes conserved across two or more species. If the novel amino acid appears in younger, non-conserved sequences, our technique will miss them. In either case, the distribution of the 23rd amino acid will be significantly narrower than that of selenocysteine, which has scattered but wide distribution . This conclusion coincides with and strengthens that of the previous research on tRNA .
Yet another possibility is that the 23rd amino acid exists but is not encoded by stop codons. It is well known that the genetic code varies in several organisms . Thus, certain organisms may use one of the sense codons for the novel amino acid. Because codons for most amino acids are degenerate, redefinition of one of them is feasible. However, that possibility is beyond the scope of this study and is left as an open problem. Bioinformatics analysis of unusual tRNA genes and codon usage may provide insights into this problem.
In addition to the 23rd amino acid, our method can simultaneously explore selenoproteins and other readthrough proteins. A common assumption in microbial selenoprotein predictions is that selenoproteins will have cysteine homologs. Zhang et al.  examined the validity of this assumption using a SECIS-based method and concluded that selenoproteins without cysteine homologs will be extremely rare. Our method can reassess this assumption in a SECIS-independent way. Such selenoproteins identified through our screening of nearly 200 microbial genomes were selenoprotein A and only one uncertain candidate. Therefore, selenoproteins that lack cysteine homologs will be scarce, as previously reported.
Other readthrough proteins with canonical amino acids (i.e., proteins that have canonical amino acids at their inframe stop codons) are quite rare in prokaryotes . The result reported here is in agreement, but it is not conclusive. This is because our method assumes that stop-codon-encoded amino acid is located inside a domain, but it is unclear whether it holds true in prokaryotic readthrough with canonical amino acids. At least, only one experimentally-confirmed example from a pathogenic strain of Escherichia coli , whose genome is not yet determined, does not obey this rule. What can be concluded from our result is that this type of readthrough will be located outside of domains, such as a linker between two domains. Such a stop codon may behave as a switch that regulates production of short and long isoforms from a single mRNA, as in readthrough genes from viruses .
To explore the possibility of a 23rd amino acid, ORFs in prokaryotic genomes were investigated in a comprehensive way. Although many of the currently known selenoproteins and pyrrolysine proteins were successfully detected, no candidate for the 23rd amino acid was discovered. Therefore, if such an amino acid exists, it will have limited distribution in the tree of life. Alternatively, it may be encoded by one of the sense codons. From the viewpoint of selenoprotein prediction, the sensitivity of our method was lower than an existing method. However, our method has several unique features. It is applicable to general readthrough genes and rigorously excludes pseudogenes and sequencing errors. Moreover, it does not assume the occurrence of non-readthrough homologs in the public databases. It will help in identification of novel readthrough genes from the rapidly expanding collection of complete microbial genomes.
A total of 328 complete genome sequences of prokaryotes were downloaded from the KEGG FTP site  in April 2006. From them, 191 representative organisms were selected by excluding close relatives. The threshold was set to average sequence identity 90% of two house-keeping genes, DNA polymerase III α subunit and alanyl-tRNA synthetase. From these 191 genomes, iORFs longer than 80 codons were enumerated using inhouse software, which is available from the author's web site . Both upstream and downstream regions of its inframe stop codon were required to be longer than 10 codons. Two stop codons of an iORF (i.e. the inframe and C-terminal stops) can be any combination of canonical stop codons (TAA, TAG, TGA). However, for Mycoplasma, only TAA and TAG were used. Three codons ATG, TTG and GTG were allowed to be start signals.
The iORFs of each organism were compared with protein-coding genes of the organism using BLASTX. If an iORF matched any protein-coding genes (E-value < 10-3) and their reading frames did not coincide, the iORF was discarded. Similarly, iORFs were compared with RNA genes using BLASTN, and those matched with the RNAs were removed. Remaining iORFs were translated into amino acid sequences. We translated all three types of nonsense codons into the one-letter code U, so as to simplify visual inspection of sequence alignments. Although the code U is usually for selenocysteine, it will be harmless because U is automatically converted into × inside the BLAST programs.
To examine evolutionary conservation of the iORFs, a window of 101 residues around the inframe stop codon was extracted and subjected to TBLASTN searches against the above 191 genome sequences. If there were any hits (E-value < 0.01) in other organisms, and if the hit includes 10 upstream and 10 downstream residues of the inframe stop codon, then the iORF was retained. However, if there were any hits (E-value < 10-5) that did not cover the inframe stop codon, the iORF was discarded. Eligible iORFs were then clustered using BLASTCLUST with score density 0.5 and minimum length coverage 0.6. After removing singleton clusters, multiple sequence alignments of the remaining clusters were computed using MAFFT  with the L-INS-i option. Subsequently, conservation of the inframe stop codons in each cluster was examined. If the location or type of stop codons was not identical, the cluster was discarded.
The first filter examines protein-likeliness of the iORFs. This filter is mainly designed to remove conserved non-coding sequences (CNS) immediately downstream of non-readthrough genes. If we measure purifying selection for amino acid sequences by the ratio of nonsynonymous to synonymous substitution rates (dN/dS), a protein with a stop-codon-encoded amino acid will indicate the sign of selection, while CNS will not. The dN/dS was calculated for each of the two parts flanking the inframe stop codon in an iORF using codeml program in the PAML package . Statistical significance was estimated by likelihood ratio test . The observed alignment was fitted to two distinct substitution models, one of which estimates dN/dS from the data, and the other fixes it to 1.0. Let l free and l fix denote log likelihood of these models. Then, 2Δl = 2(l free – l fix) approximately follows the χ2 distribution with one degree of freedom. If dN/dS was less than 1.0, and the statistics 2Δl was larger than a threshold, we regard it as a sign of purifying selection. In this study, the threshold was set to 5.0 (corresponds to P < 0.025) so that the known readthrough proteins score higher than the threshold. For each of the above clusters, an all-against-all comparison of cluster members was performed. If any pair exhibits such signals in both the N- and C-terminal parts, the cluster was retained.
Even if both the upstream and downstream regions of the inframe stop codon code proteins, they may be two adjacent genes instead of a readthrough protein. The second and third filtering processes remove such genes based on BLAST alignment patterns. Although the boundary analysis applied previously has the same goal (Figure 1b), some gene pairs escaped elimination. To enhance sensitivity of the filters, the whole length of an iORF was used as a BLAST query instead of the partial sequence, and the size of the BLAST database was increased from the 191 nonredundant genomes to the 328 complete genomes in GenomeNet and 246 draft genome sequences downloaded from GenBank in May 2006.
The second filter inspects synteny of iORFs. If the N- and C-terminal parts of an iORF have distinct but closely arranged BLAST hits in other genomes, it strongly suggests the iORF is actually two adjacent genes. Translated sequences of iORFs in the pre-filtering clusters were subjected to TBLASTN searches against the genome database. If both the best hits of the N- and C-terminal parts are statistically significant (E-value < 10-5), and distance between them is less than 1 kbp, we call these hits 'syntenic hits'. If any syntenic hits with non-coinciding reading frames were found, the cluster was removed.
The third filter uses co-occurrence of residues around the inframe stop codon as another source of information for screening stop codon readthrough. Suppose a window of 21 residues centered at the inframe stop codon. In prokaryotes, most stop-codon-encoded amino acids are located inside a domain, the unit of evolutionary sequence conservation. Therefore, in an ideal situation the presence or absence of the 21 residues in alignments will be synchronized. In contrast, if the iORF is actually two adjacent genes, then upstream and downstream residues of the stop codon will appear separately in many alignments. We defined a co-occurrence matrix as a 21 × 21 matrix whose (i,j)-th element represents how often residue i and j appeared simultaneously in N alignments. The matrix elements were subsequently normalized to the number of alignments N. By definition, the more often the upstream and downstream residues of the inframe stop codon co-occur in the alignments, the higher the density in the upper right quarter of the matrix. If average density in the quarter was lower than 0.85, the cluster was filtered out.
For each organism, its iORFs were extracted from the pre-filtering clusters, and codon usage at the inframe stop positions was counted. Codon usage at the C-terminal stop codons in its proteome was also computed using data of coding sequences downloaded from KEGG GENES . These data were combined into a 3 × 2 matrix, and Fisher's exact test was applied. The p-value was corrected for multiple testing using the Bonferroni correction because there were 127 organisms in the pre-filtering clusters.
We are grateful to Nelson Hayes for helpful comments on the paper. This work was supported by grants from the Ministry of Education, Culture, Sports, Science and Technology of Japan, the Japan Science and Technology Agency and ICR Grants for Young Scientists. Computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.