Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes
© Makita et al; licensee BioMed Central Ltd. 2007
Received: 31 May 2006
Accepted: 08 February 2007
Published: 08 February 2007
Computational prediction methods are currently used to identify genes in prokaryote genomes. However, identification of the correct translation initiation sites remains a difficult task. Accurate translation initiation sites (TISs) are important not only for the annotation of unknown proteins but also for the prediction of operons, promoters, and small non-coding RNA genes, as this typically makes use of the intergenic distance. A further problem is that most existing methods are optimized for Escherichia coli data sets; applying these methods to newly sequenced bacterial genomes may not result in an equivalent level of accuracy.
Based on a biological representation of the translation process, we applied Bayesian statistics to create a score function for predicting translation initiation sites. In contrast to existing programs, our combination of methods uses supervised learning to optimally use the set of known translation initiation sites. We combined the Ribosome Binding Site (RBS) sequence, the distance between the translation initiation site and the RBS sequence, the base composition of the start codon, the nucleotide composition (A-rich sequences) following start codons, and the expected distribution of the protein length in a Bayesian scoring function. To further increase the prediction accuracy, we also took into account the operon orientation. The outcome of the procedure achieved a prediction accuracy of 93.2% in 858 E. coli genes from the EcoGene data set and 92.7% accuracy in a data set of 1243 Bacillus subtilis 'non-y' genes. We confirmed the performance in the GC-rich Gamma-Proteobacteria Herminiimonas arsenicoxydans, Pseudomonas aeruginosa, and Burkholderia pseudomallei K96243.
Hon-yaku, being based on a careful choice of elements important in translation, improved the prediction accuracy in B. subtilis data sets and other bacteria except for E. coli. We believe that most remaining mispredictions are due to atypical ribosomal binding sequences used in specific translation control processes, or likely errors in the training data sets.
Genome sequencing provides investigators with a plain genome text, with no biological indication of the genes' location. The first task associated with genome annotation is therefore gene identification. In recent years, gene prediction methods have been developed as part of many genome projects. Based on criteria strictly defined by previously known genes, the best computational gene identification methods for prokaryote genomes show sensitivities of 98–99% or higher for proper identification of the genes' reading frames . However, based on the widespread assumption that Open Reading Frames (ORFs) and Coding DNA sequences (CDSs) label the same objects, this level of prediction accuracy is calculated using the 3' end location of each gene, not the actual gene span. One of the most widely used methods, Glimmer , tends to predict the CDS to be the longest possible ORF displaying a particular nucleotide pattern based on Markov chain analysis and starting with the first possible translation initiation codon (ATG, TTG or GTG). The conceptual basis of Glimmer rests on the original periodical Markov Chain Analysis approach, GeneMark, which for precise prediction of the gene's 5' end, also considers sequence features located upstream of the translation initiation sites. The resulting accuracy is 5–30% lower than the 3' end predictions . GeneMark often succeeds better in correct gene identification because it is based on discrimination between typical protein coding states and atypical protein coding states, which is assumed to be populated with genes horizontally transferred into a given microbial genome. This was illustrated, for example, with identification of the cyaY gene in Escherichia coli  and the secE gene in Helicobacter pylori .
A more accurate translation initiation site (TIS) prediction is important not only for the annotation of unknown CDSs but also for operon prediction  and promoter prediction. Furthermore, in silico prediction of genes coding for small untranslated RNAs  also depends on the correct identification of intergenic (inter CDS) distances.
Most existing tools use an unsupervised learning method, using E. coli data sets for validation, due to the lack of experimentally validated data sets in other organisms. In the present work, we adopted a supervised machine learning method for the following reasons. First, we took into account that in the current annotation situation, human annotation is still more reliable than any computational genome-wide predictions, suggesting that by trying to mimic the human approach we might construct more reliable data sets. Second, supervised learning assumes that we implement some knowledge of what we can consider as the most important elements in the prediction method. Furthermore, it is difficult to know the range of correct applicability with unsupervised algorithms without deep knowledge of the algorithms. For example, in a recent comparison between the TiCo algorithm and MED-Start, the latter showed surprisingly low accuracies (around 5%) with high GC-content genomes, although it showed over 90% accuracy in the E. coli data set . This is in line with the general difficulty to identify translation start sites in GC-rich organisms where the lack of A or T nucleotides results in long ORFs due to purely statistical reasons. To construct an in silico model of translation initiation based on biological knowledge, we take into account the following elements.
First of all, the Ribosome Binding Site (RBS, also named the Shine-Dalgarno sequence, after the name of the authors who proposed that mRNA had to interact with the 16S RNA to permit initiation of translation ) is one of the most important elements for translation initiation. The RBS sequence is recognized by a sequence near the 3' end of 16S rRNA in the 30S ribosomal subunit. After the 30S ribosomal subunit binds to mRNA by base pairing to the RBS sequence, the fMet-tRNA identifies the initiation codon and binds to the complex. Next, the 50S ribosomal subunit binds to the complex and begins to elongate the nascent polypeptide .
Compared to Bacillus subtilis, Escherichia coli has relatively short or poorly conserved RBS sequences. To be able to separate these weak RBS sequences from the noise, E. coli has an S1 protein that plays an important role in the correct presentation of most mRNAs to the ribosome. The recognition signal of the S1 protein for binding mRNA has been studied in its molecular details but is not yet completely understood. The S1 protein binds to the leader sequence of mRNAs, upstream of the RBS sequence. On synthetic RNAs, S1 has no strict sequence specificity and binds polyU, polyC, and polyA, as well as various heterogeneous RNAs. However, it has been shown to present sequences possessing the GAGG sequence to the RegB nuclease of bacteriophage T4 , indicating that it has indeed a role in the recognition of the core sequence of the RBS. In contrast, B. subtilis or A+T-rich Firmicutes do not possess an S1 protein. (B. subtilis has a counterpart, YpfD, but this protein is not involved in translation ). Finally, both E. coli and B. subtilis are weakly AU-rich upstream of the RBS sequence. A difficulty encountered with GC-rich organisms is that long Gs stretches can easily be mistaken for authentic RBSs. For an accurate prediction of the TIS, we also need to consider translational reinitiation when several cistrons belong to a common transcript. Translational reinitiation frequently occurs if the initiation codon is an AUG, a RBS sequence is present, and the termination codon of the preceding CDS lies between the RBS sequence and the AUG or overlaps the RBS. In this case, the 70S ribosome does not need to be dissociated into 50S and 30S ribosome subunits  to allow translation initiation. Therefore, translational reinitiation signals may be different from canonical initiation.
Frequency of translation initiation site code
An A-rich sequence following the start codon is typically found in both B. subtilis and E. coli . Those A-rich (A/U rich) sequences probably stimulate translation initiation by excluding secondary RNA structures .
Furthermore, we also took into account the fact that biases introduced by translation may affect the translation process, discriminating between two types of intergenic distance distributions; head to head (< -- >) and tail to head (- > - >) cases, for assuming the non-operon/operon structures.
For each of these biological considerations, we assessed to what degree they can contribute to the TIS prediction accuracy, as described in the Results. Based on this evaluation, we selected six elements (see Methods) and combined them into a single score function using Bayesian statistics.
This Bayesian supervised learning method for TIS prediction, which we named Hon-yaku ("translation" in Japanese), showed a prediction accuracy of over 90% for both E. coli and B. subtilis. We also applied this method to GC-rich Gamma-Proteobacteria that do not have any experimentally validated TIS data sets. Our Python scripts can be downloaded . After construction of a reference data set based on core genome sequences, the scripts can be used with some basic knowledge of Python to predict TISs in newly sequenced bacterial genomes. To obtain training data sets, we chose genes that have strong sequence similarity to E. coli or B. subtilis data sets, retaining the genes that display genome persistence . Our algorithm also performed well in P. aeruginosa, B. pseudomallei, and the newly sequenced genome of the Beta-proteobacterium Herminiimonas arsenicoxydans, which can metabolize arsenic.
Results and discussion
RBS sequence motif comparison
Except for some special cases such as leaderless genes, most genes have an RBS sequence around 3–8 bp upstream from the TIS. We considered several RBS motif categories that represent the gene essentiality, the position of each operon, and the organism specificity.
Comparison of information content score in various data sets
# of genes
Score of IC
Rudd K.E. 
Fang G. et al. 
Fang G. et al. 
Yada T. et al. 
Currently, essential genes are defined by in vivo experiments in several species [17–19]. To investigate a possible contribution of gene essentiality to RBS sequence conservation, we calculated the IC for essential genes and persistent genes, which are strongly conserved in most bacterial genomes . Interestingly, we could not detect specific RBS sequence features which would relate to gene essentiality or persistence, thus validating the use of persistent genes in the training set (as they would not introduce a bias in TIS identification). The IC scores of these particular sets were not larger than the EcoGene data set score, which is the largest data set. We therefore decided to use the RBS sequences extracted from the EcoGene data set.
By contrast, there are significant differences between organisms: B. subtilis, which does not have a S1 protein, shows the largest score of the three organisms (Table 2). This is consistent with the role of protein S1 in the attachment of the mRNA to the 16S rRNA in E. coli .
Accuracy of the method
Selecting the order of the Markov model
Comparison of the accuracy of N th order Markov model
# of genes
Assimilation vs discrimination
To calculate the relevant Bayesian probability, we considered two alternative models (see Methods). In the first model, an assimilation model, we assumed that base frequencies of non-TIS sequences near a candidate start codons are the same as in the genome-wide background model (Eq. 8). In the second model, a discrimination model, we learned the base frequencies near a non-TIS from the negative data set (Eq. 9). This might have led to an improvement of the outcome, similar to that using discrimination in CDS identification, illustrated by the better accuracy using GeneMark in gene identification . However, the overall accuracy reported by each model was exactly the same, although different genes were predicted incorrectly by the two approaches. This comparison shows that the differences between background and non-RBS sequences are relatively small.
In this paper, we used the assimilation model, as it is simpler than but achieves the same accuracy as the discrimination model.
Comparison with the TiCo, MED-Start, GS-Finder, and RBSfinder TIS prediction programs
Organism (data set)
# of genes
E. coli (EcoGene)
(81.9% b )
E. coli (Link)
(80.0% b )
B. subtilis (non-y)
(78.5% b )
(82.8% b )
Comparison with validation methods
Organism (data set)
# of genes
10% cross validation
20% cross validation
E. coli (EcoGene)
B. subtilis (non-y)
In Hon-yaku, the average distance between the true TIS and the predicted site is 26.2 codons for the 58 false predictions in E. coli.
Estimation of the minimum required size of the training data set
Genes without a canonical RBS motif
Split RBS motif, which would involve the S1 protein translation mechanism .
RBS-less translation supported by the S1 protein
Known unconventional mRNA binding to 16S RNA. This has been demonstrated in the case of translation initiation factor IF3.
The TIS of infC, the structural gene for translational initiation factor IF3, starts with the unusual AUU codon both in E. coli  and B. subtilis, which are separated by 1.5 billion years of evolution.
The latest version of Colibri  contains four genes starting with ATT. We tried to predict these four genes by including a non-zero probability for an ATT start codon (see Methods). Only infC had a strong enough SD sequence to allow correct prediction against the small probability of an ATT start codon. Colibri has 37 genes with an atypical start codon, of which there are 28 kinds (other than NTG or ATT). Most of these genes code for a defective protein or are functionally unknown.
Presently Hon-yaku evaluates all ATG, GTG, and TTG codons in an ORF as candidate TISs. Hon-yaku can easily be extended to include other possible start codons. However, due to the low prior probability for atypical start codons, they can only be detected if preceded by a sufficiently strong SD sequence. Finally, several cases of spurious CDSs are created by the presence of codons for the 21st and 22nd amino acids, selenocysteine and pyrrolysine, coded by TGA and TAG codons respectively .
The definition of a gene is notoriously difficult. In particular, it may happen that two different functional gene products are coded from the same DNA sequence, differing only in their start site. This is the case for the B. subtilis lysC gene, which codes for two proteins depending on two in frame start sites, resulting in a heterotetrameric alpha2/beta2 protein .
In the same way, both in E. coli and in B. subtilis, the gene infB codes for the two forms of the translational initiation factor IF2: IF2 alpha and IF2 beta. The lacZ::fused gene expresses two different products corresponding to the fused proteins IF2 alpha-beta-galactosidase and IF2 beta-beta-galactosidase, which confirms in vivo that the IF2 forms differ at their N terminus .
Examples of candidate multi TISs predictions with a high Bayesian score
Among incorrectly predicted genes, the Bayesian probability of an incorrect site was largest for the fucK gene. A BlastP search for counterparts in other genomes however suggested that the predicted start site is actually correct. Indeed, this putatively "false" TIS is annotated as the TIS in Salmonella enterica serovar Typhimurium LT2, Yersinia bercovieri, Yersinia frederiksenii, Sodalis glossindius, and Shigella boydii. We therefore presume that the Hon-yaku prediction is correct, and that the re-annotated fucK sequence is probably, for some reason, erroneous. Similar situations were uncovered in other genes, suggesting that the identification of the N-terminus of the corresponding proteins might not correspond to the primary translation product, but to some maturation product. Alternatively, those cases could suggest that some coding regions can code for polypeptides of different length, although a Pfam search did not reveal a salient functional difference between them. Finally, genes may keep multiple TIS candidates to gain robustness against gene mutations in the vicinity of the TIS.
In an attempt to improve translation initiation site prediction and to make it applicable in a variety of bacterial genomes, we introduced biological knowledge of the translation process in the Hon-yaku algorithm. We considered the RBS sequence, the distance between the TIS and the RBS sequence, the nature of the start codon, the A-rich sequences following start codons, and the distribution of the protein length ratio to compute Bayesian joint score function. Additionally, using the operon structure predicted from the intergenic distances increases the accuracy by around 2%. Hon-yaku displays all these scores together with the total Bayesian probability for every TIS candidate as a means to improve the objectivity of human annotation.
In addition to user-friendliness, the reason why most existing programs adopt an unsupervised approach is the absence of experimentally validated TIS data. Although a supervised learning method requires more effort for the creation of a training data set, it identifies organism-specific features and allows the user to produce a final description of the best features relevant to a specific organism.
Hon-yaku uses a training set derived from models where TISs have been experimentally established (E. coli and B. subtilis), so strictly speaking, the extrapolating of our successful identifications are limited to Gamma-Proteobacteria and Firmicutes. Further work with other distant clades will be needed to see whether it can be generalised to the whole Bacteria kingdom.
Motif information content
Information content of motif X is
where i is the position, L is the length of the motif, and n is the each nucleotide A, C, G, and T. For the information content calculation based on N data set sequences, we added pseudocounts, using the background probability of each base frequency. We used the upstream 30 bp and downstream 20 bp from TIS sites for the calculation.
Experimentally validated data set for translation initiation sites
We used the EcoGene database  and Link data set  as reliable data sets of translation initiation sites in E. coli. The EcoGene database contains 862 proteins that were confirmed by N-terminal protein sequence identification. We removed from the data set a selenoprotein, release factor 2 (which is known to be synthesized by a + 1 frameshift), as well as two genes starting with ATT instead of canonical start codons (ATG, GTG, and TTG),.
The Link data set contains 195 genes; four of these are not consistent with the EcoGene data set. To construct a fully reliable data set, we removed these four genes (hdeB, leuB, lolA, and ydcG). For B. subtilis, we used a data set of 1248 'non-y' (i.e., experimentally characterized) genes  and checked them using the new GenBank annotation (NC_000964.2). Two genes had been removed in the new GenBank annotation, and three codons previously identified as start codons were changed to ATC, ATT, and CTG. We removed those data, leaving 1243 genes in the data set. We also included the more reliable 58 sequences confirmed by comparison with homologous sequences of Bacillus halodurans .
Constructing data set with sequence homology
When we apply Hon-yaku to a newly sequenced bacterial genome such as H. arsenicoxydans, we need to construct a reliable data set with strong sequence homology to experimentally validated genes. Using the currently available two data sets, the EcoGene data set and the B. subtilis non-y data set, we defined presumably correct start sites for genomes where experimental data on actual start sites is missing by using the set of related persistent genes (, this works for Proteobacteria and Firmicutes) aligning them individually with counterparts in model organisms (E. coli and B. subtilis), and choosing manually the start site.
Pick up orthologous genes from the EcoGene data set or B. subtilis non-y data set.
Remove genes that are not aligned in TIS vicinity or that have two or more candidate TISs within 5 bp. With the 165 orthologous genes, we confirmed that 89% of the TIS position differences are less than 5 bp. We removed genes whose TISs is not located within 5 bp upstream or downstream from the experimentally validated TIS, and that have no other candidate TIS within these 5 bp vicinity. From these rules, we obtained a data set of 126 genes with 100% accuracy out of the 165 orthologous genes.
We applied this procedure to P. aeruginosa, B. pseudomallei, and H. arsenicoxydans to construct the training data sets.
Modeling to predict translation initiation sites
The motif sequence around the ribosomal binding site (RBS), identifying the RBS region using a weight matrix constructed from the reference data set
The empirically determined distance between the RBS sequence and the start codon
The base composition of the start codon
The base composition of the beginning of the protein coding sequence with a position specific scoring matrix
The empirically determined length of the protein
Additionally we took into account overlapping ORFs using the empirically determined intergenic distance distributions. This methodology requires only the positions of stop codons and evaluates all TIS candidates that are located between the stop codon to the nearest upstream stop codon. We used the annotation by running GeneMark  on the genome of H. arsenicoxydans and by using GenBank entries for the other organisms.
Motif search around the RBS
Different tools adopted different methods to model the RBS. Hannenhalli et al. used the RBS binding energy to find the RBS motif . The program RBSfinder considers the number of hydrogen bonds to detect motifs complementary to the 3' end of the 16S rRNA . GS-Finder uses the "Z-curve" method , which considers differences of the cumulative occurrence numbers for three kinds of base combinations . GS-Finder considers the A, C, G, T contexts in a window. Recently, because of the remarkable progress in motif extraction tools and to avoid having to calculate the binding energy between an organism-dependent 16S rRNA and the mRNA, position specific weight matrices (0th order Markov Model) have been applied for describing the RBS sequence motif (ex. MED-Start ). In this paper, we also used a zeroth-order Markov model, while, in addition, we explored higher-order Markov models. To describe the motif sequences by a 1st-order Markov model, we denote the transition probability of the double bases "mn" as a mn = P (x i = n|xx-1= m). The probability that the motif sequence S M is generated by this model is then:
where i is the position and L is the length of the motif.
The log-likelihood ratio that the sequence S M is created by the model is
where is the weight matrix of 1st-order Markov chain for a nucleotide n at position i to be followed by the nucleotide m. We prepared one log-odds scoring matrix M SD to describe the conserved region around the ribosomal binding site, and another matrix M DS to describe the downstream adenine-rich region following the start codon. Those motifs are defined by multiple alignments. In this section, we described the 1st order Markov model. When comparing the 0th, 1st, and 2nd order Markov model in E. coli, B. subtilis, and Herminiimonas arsenicoxydans, we found that a 1st-order Markov model yields more accurate results in both E. coli and B. subtilis, whereas a zeroth-order model was most accurate for Herminiimonas arsenicoxydans (Table 3).
The empirically determined distance from a RBS sequence to a start codon
Base composition of start codons
Table 1 shows the frequency of each start codon for the three bacteria. We also calculated the frequency of ATG, GTG, and TTG codons upstream and downstream of the true TIS to create a negative TIS data set (Eq. 9).
Distribution of protein length ratio
Combining features around TIS
The Bayesian posterior probability that a gene starts from the translation initiation site TIS can be calculated as
where the prior probability Pprior (TIS) is calculated as the frequency of start codon. P (S, D protein |TIS) is the conditional probability that the sequence S is generated around a true translation initiation site, resulting in a protein coding region of length D protein . The sequence S around the TIS consists of the ribosomal binding site S SD , the start codon S STC , the sequence S DS content downstream of the TIS, and the remaining sequence S\S SD S TIS S DS . We can then decompose P (S, D protein |TIS) into six parts:
P (S, D protein |TIS)
= P (S SD |TIS)·fdist (DSD 2STC)·P (S STC |TIS)
·P (S DS |TIS)·fdist (D protein )·P (S\S SD S TIS S DS |background), (5)
fdist (DSD 2STC) is the probability that S RSB is generated at a distance DSD 2STCfrom the transcription start site, and fdist (D protein ) is the distribution of the protein length.
Dividing by the background probability yields
where M SD and M DS are the value of the PSSM score for the RBS sequence and downstream region around the translation initiation site and P (STC|TSS) is the base composition of start codon, as determined from the E. coli known data set.
We define the score functions
score(TIS) ≡ ln Pprior (TIS) + M SD + ln fdist (DSD 2STC)
+ ln P (STC|TSS) + M DS + ln fdist (D protein ). (7)
For the calculation of P (TIS|S, D protein ), we can consider either an assimilation method(Eq: 8) or a discrimination method(Eq: 9). The assimilation method makes the assumption that the base frequency around an ATG, GTG, TTG codon that is not a start codon is the same as the whole genome background model.
where nonTIS represents an ATG, GTG, or TTG codon that does not function as a start codon.
In the discrimination method, we need to make negative data sets which explicitly model nonTIS features. In this case, we made two models, which represent the upstream (intergenic) region nonTIS up , and the downstream (in coding region) nonTIS down to distinguish between protein coding features and non-coding features.
In Hon-yaku, we calculate score (TIS) and the Bayesian posterior probability that a gene starts from the TIS for all translation initiation sites in the ORF.
Other contributing elements
To increase the prediction accuracy, we additionally considered the operon structure, and alternative candidate start codons that are either adjacent or separated by one codon.
If the two genes are arranged in a head-to-head configuration and the intergenic distance is under 100 bp, we added an empirically determined intergenic distance distribution ln (f dist (D headtohead )) to the score function (Eq. 7). If the two genes have the same direction and the intergenic distance is under 50 bp, we added an empirically determined intergenic distance distribution ln (f dist (Dtailtohead_under 50bp)) to the score function. Thus, we aimed to reduce mispredictions leading to genes with long overlapping sequence regions. This function also improves the prediction of genes with the start codon close to the previous stop codon, as often occurs in operons.
Another reason for incorrect predictions is that some genes have two start codon candidates close to each other. Especially when two candidates are contiguous, the distance function between the start codon and the RBS sequence f dist (DSD 2STC) gives ambiguous results. In this case, our algorithm chooses the TIS based on the distribution of the start codon location for MM and MXM amino acid sequences. We constructed the species-specific distribution in E. coli and B. subtilis and applied the E. coli distribution to other bacteria that have a small number of data set genes.
Except for this two neighboring start codon case, which had to be fixed as described above, we established the value of all other parameters using the training data set.
In this paper, we calculated accuracies of Hon-yaku with a leave-one-out cross validation analysis. To avoid showing only the overoptimistic performance rates of the leave-one-out measure, we also calculated the performance of our method with other cross validations. We trained our model with 90% or 80% of the true data set, while the randomly chosen remaining 10% or 20% are retained for subsequent use in evaluating our model. The procedure was repeated one thousand times.
We thank Kenta Nakai of the Univ. of Tokyo for his kind advice on this manuscript. YM was supported by a scholarship from the Association des Amis de l'Institut Pasteur in Japan. Gene identification in Bacteria was supported by the European Union Network of Excellence BioSapiens, grant LSHG CT-2003-503265.
- Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucleic Acids Research 1999, 27(23):4636–41. 10.1093/nar/27.23.4636PubMed CentralView ArticlePubMedGoogle Scholar
- Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Research 2001, 29(12):2607–18. 10.1093/nar/29.12.2607PubMed CentralView ArticlePubMedGoogle Scholar
- Trotot P, Sismeiro O, Vivares C, Glaser P, Bresson-Roy A, Danchin A: Comparative analysis of the cya locus in enterobacteria and related gram-negative facultative anaerobes. Biochimie 1996, 78(4):277. 10.1016/0300-9084(96)82192-4View ArticlePubMedGoogle Scholar
- Medigue C, Wong B, Lin M, Bocs S, Danchin A: The secE gene of Helicobacter pylori . J Bacteriol 2002, 184(10):2837. 10.1128/JB.184.10.2837-2840.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Moreno-Hagelsieb G, Collado-Vides J: A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 2002, (Suppl 1):S329–36.Google Scholar
- Carter RJ, Dubchak I, Holbrook SR: A computational approach to identify genes for functional RNAs in genomic sequences. Nucleic Acids Research 2001, 29(19):3928–38.PubMed CentralPubMedGoogle Scholar
- Tech M, Meinicke P: An unsupervised classification scheme for improving predictions of prokaryotic TIS. BMC Bioinformatics 2006, 7: 121. 10.1186/1471-2105-7-121PubMed CentralView ArticlePubMedGoogle Scholar
- Shine J, Dalgarno L: The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci USA 1974, 71(4):1342–6. 10.1073/pnas.71.4.1342PubMed CentralView ArticlePubMedGoogle Scholar
- Petersen H, Danchin A, Grunberg-Manago M: Toward an understanding of the formylation of initiator tRNA methionine in prokaryotic protein synthesis. II. A two-state model for the 70S ribosome. Biochemistry 1976, 15(7):1362–9. 10.1021/bi00652a002View ArticlePubMedGoogle Scholar
- Lebars I, Hu RM, Lallemand JY, Uzan M, Bontems F: Role of the substrate conformation and of the S1 protein in the cleavage efficiency of the T4 endoribonuclease RegB. J Biol Chem 2001, 276(16):13264–7. 10.1074/jbc.M010680200View ArticlePubMedGoogle Scholar
- Nitschke P, Guerdoux-Jamet P, Chiapello H, Faroux G, Henaut C, Henaut A, Danchin A: Indigo: a World-Wide-Web review of genomes and gene functions. FEMS Microbiol Rev 1998, 22(4):207–27.View ArticlePubMedGoogle Scholar
- Kozak M: Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene 2005, 361: 13–37. 10.1016/j.gene.2005.06.037View ArticlePubMedGoogle Scholar
- Rocha EP, Viari A, Danchin A: Oligonucleotide bias in Bacillus subtilis : general trends and taxonomic comparisons. Nucleic Acids Research 1998, 26(12):2971–80. 10.1093/nar/26.12.2971PubMed CentralView ArticlePubMedGoogle Scholar
- Qing G, Xia B, Inouye M: Enhancement of translation initiation by A/T-rich sequences downstream of the initiation codon in Escherichia coli . J Mol Microbiol Biotechnol 2003, 6(3–4):133–44. 10.1159/000077244View ArticlePubMedGoogle Scholar
- Fang G, Rocha E, Danchin A: How essential are nonessential genes? Mol Biol Evol 2005, 22(11):2147–56. 10.1093/molbev/msi211View ArticlePubMedGoogle Scholar
- Hutchison CA, Peterson SN, Gill SR, Cline RT, White O, Fraser CM, Smith HO, Venter JC: Global transposon mutagenesis and a minimal Mycoplasma genome . Science 1999, 286(5447):2165–9. 10.1126/science.286.5447.2165View ArticlePubMedGoogle Scholar
- Kobayashi K, Ehrlich S, Albertini A, Amati G, Andersen K, Arnaud M, Asai K, Ashikaga S, Aymerich S, Bessieres P, Boland F, Brignell S, Bron S, Bunai K, Chapuis J, Christiansen L, Danchin A, Debarbouille M, Dervyn E, Deuerling E, Devine K, Devine S, Dreesen O, Errington J, Fillinger S, Foster S, Fujita Y, Galizzi A, Gardan R, Eschevins C, Fukushima T, Haga K, Harwood C, Hecker M, Hosoya D, Hullo M, Kakeshita H, Karamata D, Kasahara Y, Kawamura F, Koga K, Koski P, Kuwana R, Imamura D, Ishimaru M, Ishikawa S, Ishio I, Le Coq D, Masson A, Mauel C, Meima R, Mellado R, Moir A, Moriya S, Nagakawa E, Nanamiya H, Nakai S, Nygaard P, Ogura M, Ohanan T, O'Reilly M, O'Rourke M, Pragai Z, Pooley H, Rapoport G, Rawlins J, Rivas L, Rivolta C, Sadaie A, Sadaie Y, Sarvas M, Sato T, Saxild H, Scanlan E, Schumann W, Seegers J, Sekiguchi J, Sekowska A, Seror S, Simon M, Stragier P, Studer R, Takamatsu H, Tanaka T, Takeuchi M, Thomaides H, Vagner V, van Dijl J, Watabe K, Wipat A, Yamamoto H, Yamamoto M, Yamamoto Y, Yamane K, Yata K, Yoshida K, Yoshikawa H, Zuber U, Ogasawara N: Essential Bacillus subtilis genes. Proc Natl Acad Sci USA 2003, 100(8):4678–83. 10.1073/pnas.0730515100PubMed CentralView ArticlePubMedGoogle Scholar
- Ji Y, Zhang B, Van SF, Horn , Warren P, Woodnutt G, Burnham M, Rosenberg M: Identification of critical staphylococcal genes using conditional phenotypes generated by antisense RNA. Science 2001, 293(5538):2266–9. 10.1126/science.1063566View ArticlePubMedGoogle Scholar
- Escherichia coli and Salmonella: Cellular and Molecular Biology. In Science. Volume 2. Washington, DC: ASM Press; 1996:902–8.Google Scholar
- Link AJ, Robison K, Church GM: Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K-12 . Electrophoresis 1997, 18(8):1259–313. 10.1002/elps.1150180807View ArticlePubMedGoogle Scholar
- Zhu HQ, Hu GQ, Ouyang ZQ, Wang J, She ZS: Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics 2004, 20(18):3308–17. 10.1093/bioinformatics/bth390View ArticlePubMedGoogle Scholar
- Ou HY, Guo FB, Zhang CT: GS-Finder: a program to find bacterial gene start sites with a self-training method. Int J Biochem Cell Biol 2004, 36(3):535–44. 10.1016/j.biocel.2003.08.013View ArticlePubMedGoogle Scholar
- Suzek BE, Ermolaeva MD, Schreiber M, Salzberg SL: A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics 2001, 17(12):1123–30. 10.1093/bioinformatics/17.12.1123View ArticlePubMedGoogle Scholar
- Boni IV, Artamonova VS, Tzareva NV, Dreyfus M: Non-canonical mechanism for translational control in bacteria: synthesis of ribosomal protein S1. EMBO Journal 2001, 20(15):4222–32. 10.1093/emboj/20.15.4222PubMed CentralView ArticlePubMedGoogle Scholar
- Skorski P, Leroy P, Fayet O, Dreyfus M, Hermann-Le Denmat S: The Highly Efficient Translation Initiation Region from the Escherichia coli rpsA Gene Lacks a Shine-Dalgarno Element. J Bacterial 2006, 188(17):6277–85. 10.1128/JB.00591-06View ArticleGoogle Scholar
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research 2003, 31(13):3406–15. 10.1093/nar/gkg595PubMed CentralView ArticlePubMedGoogle Scholar
- Huerta AM, Collado-Vides J: Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. J Mol Biol 2003, 333(2):261–78. 10.1016/j.jmb.2003.07.017View ArticlePubMedGoogle Scholar
- Laursen BS, Sorensen HP, Mortensen KK, Sperling-Petersen HU: Initiation of protein synthesis in bacteria. Microbiol Mol Biol Rev 2005, 69: 101–23. 10.1128/MMBR.69.1.101-123.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Uzan M: Bacteriophage T4 RegB endoribonuclease. Methods Enzymol 2001, 342: 467–80.View ArticlePubMedGoogle Scholar
- Brombach M, Pon CL: The unusual translational initiation codon AUU limits the expression of the infC (initiation factor IF3) gene of Escherichia coli . Mol Gen Genet 1987, 208(1–2):94–100. 10.1007/BF00330428View ArticlePubMedGoogle Scholar
- Medigue C, Viari A, Henaut A, Danchin A: Colibri: a functional data base for the Escherichia coli genome. Microbiol Rev 1993, 57(3):623–54.PubMed CentralPubMedGoogle Scholar
- Chaudhuri BN, Yeates TO: A computational method to predict genetically encoded rare amino acids in proteins. Genome Biol 2005, 6(9):R79. 10.1186/gb-2005-6-9-r79PubMed CentralView ArticlePubMedGoogle Scholar
- Chen N, Paulus H: Mechanism of expression of the overlapping genes of Bacillus subtilis aspartokinase II. J Biol Chem 1988, 263(19):9526–32.PubMedGoogle Scholar
- Plumbridge J, Deville F, Sacerdot C, Petersen H, Cenatiempo Y, Cozzone A, Grunberg-Manago M, Hershey J: Two translational initiation sites in the infB gene are used to express initiation factor IF2 alpha and IF2 beta in Escherichia coli . EMBO J 1985, 4: 223–9.PubMed CentralPubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshal IM, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32: D138–41. 10.1093/nar/gkh121PubMed CentralView ArticlePubMedGoogle Scholar
- Rudd KE: EcoGene: a genome sequence database for Escherichia coli K-12 . Nucleic Acids Research 2000, 28: 60–4. 10.1093/nar/28.1.60PubMed CentralView ArticlePubMedGoogle Scholar
- Yada T, Totoki Y, Takagi T, Nakai K: A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Research 2001, 8(3):97–106. 10.1093/dnares/8.3.97View ArticlePubMedGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective of protein families. Science 1997, 278(5339):631–7. 10.1126/science.278.5338.631View ArticlePubMedGoogle Scholar
- Rocha EP, Danchin A, Viari A: Translation in Bacillus subtilis : roles and trends of initiation and termination, insights from a genome analysis. Nucleic Acids Res 1999, 27(17):3567–76. 10.1093/nar/27.17.3567PubMed CentralView ArticlePubMedGoogle Scholar
- Hannenhalli SS, Hayes WS: Hatzigeorgiou AG, Fickett JW. Bacterial start site prediction. Nucleic Acids Res 1999, 27(17):3577–82. 10.1093/nar/27.17.3577PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang R, Zhang CT: Z curves, an intuitive tool for visualizing and analyzing the DNA sequences. Journal of Biomolecular Structure and Dynamics 11: 767–82.Google Scholar
- Silverman B: Density Estimation for Statistics and Data Analysis. In Journal of Biomolecular Structure and Dynamics. Chapman and Hill, London; 1986.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.