 Methodology article
 Open Access
 Published:
Computational evaluation of TIS annotation for prokaryotic genomes
BMC Bioinformatics volume 9, Article number: 160 (2008)
Abstract
Background
Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks.
Results
Based on a homogeneity assumption that gene translationrelated signals are uniformly distributed across a genome, we have established a computational method for a largescale quantitative assessment of the reliability of TIS annotations for any prokaryotic genome. The method consists of modeling a positional weight matrix (PWM) of aligned sequences around predicted TISs in terms of a linear combination of three elementary PWMs, one for true TIS and the two others for false TISs. The three elementary PWMs are obtained using a reference set with highly reliable TIS predictions. A generalized least square estimator determines the weighting of the true TIS in the observed PWM, from which the accuracy of the prediction is derived. The validity of the method and the extent of the limitation of the assumptions are explicitly addressed by testing on experimentally verified TISs with variable accuracy of the reference sets. The method is applied to estimate the accuracy of TIS annotations that are provided on public databases such as RefSeq and ProTISA and by programs such as EasyGene, GeneMarkS, Glimmer 3 and TiCo. It is shown that RefSeq's TIS prediction is significantly less accurate than two recent predictors, Tico and ProTISA. With convincing proofs, we show two general preferential biases in the RefSeq annotation, i.e. overannotating the longest open reading frame (LORF) and underannotating ATG start codon. Finally, we have established a new TIS database, SupTISA, based on the best prediction of all the predictors; SupTISA has achieved an average accuracy of 92% over all 532 complete genomes.
Conclusion
Largescale computational evaluation of TIS annotation has been achieved. A new TIS database much better than RefSeq has been constructed, and it provides a valuable resource for further TIS studies.
Background
To initiate translation in prokaryote, a ribosome binds to a specific region of mRNA and then recognizes a nearby start codon. The position of the first nucleotide base pair (bp) in the start codon is denoted by translation initiation site (TIS). The sequence upstream to the TIS, the start codon itself and the sequence downstream to the TIS show specific patterns which differ from genome to genome. The sequence at about 20 bps upstream to the TIS in most prokaryotic genes contains primarily purine rich ShineDalgarno sequence [1]. However, increasing numbers of genes with missing ShineDalgarno sequences, known as leaderless genes if they also lack a 5'untranslated region, have been reported in archaeal genomes [2]. Genomewide computational analysis on leaderless genes revealed A/T rich sequences in a region at about 30 bps further upstream [3]. The start codon in most cases shows a strong preference to the ATG triplet than to others such as TTG and GTG [4]. Sequences downstream to the TIS exhibit a periodicity of three in the codon usage. Comparative genomic studies show that the sequence patterns around the true TIS might differ significantly between genomes. With the aid of a sequence logo tool, Torarinsson et al. [3] and Zhu et al. [5] reported the variation of sequence patterns among dozens of archaeal genomes, which shed light on the understanding of the divergence of translation initiation mechanisms in prokaryote.
Knowledge of exact TIS is essential for conducting experiments involving the identification of natively purified proteins by Nterminal amino acid sequencing as well as heterologous protein production [6]. However, there are increasing concerns on the TIS annotation quality in widely used databases such as GenBank and RefSeq [5–9]. Earlier completed microbial genome projects tend to annotate the 5'most candidate start which is in frame to the stop codon [7]. On the other hand, Poole et al. [6] has observed a strong discrepancy of TIS annotation between databases CMR and RefSeq on several genomes. Despite manual corrections and periodic updates, the quality of the current TIS annotations is still largely uncertain, and it is intriguing to develop an independent method for assessing the TIS annotation reliability. Such method, if successful, may also be helpful to provide hints for further improvement. The need for developing such method is becoming more urgent for the database such as RefSeq is so widely used by experimental biologists that errors in the annotation might have big impact.
Several attempts have been made to assess the reliability of TIS annotation. Nielsen and Krogh [8] were the first to make a serious largescale assessment of the reliability of the TIS annotation in RefSeq, but their approach that takes EasyGene 1.2 as the "gold standard" for comparison is questionable. As we will see later, EasyGene's own accuracy is not outstanding, hence the biased assessment is of limited interest. Frishman et al. [10], using the Orpheus program, show that the information content of aligned TIS upstream sequences correlates with the TIS prediction accuracy. Zhu, et al. [5] made a qualitative assessment of the relative TIS annotation quality for two TIS predictors, by comparing the sequence logo [11] of aligned TIS upstream sequences. In this assessment, the sequence logo around the aligned TISs of a consensus set predicted by both predictors (called consensus logo) is considered to be reliable, and hence the difference to the sequence logo of the aligned TISs of a 'specific' set predicted by only one program (called specific logo) would indicate qualitatively the TIS accuracy of that program. Taking S. solfataricus as an example, Zhu, et al. [5] showed that the specific sequence logo of MED 2.0 is very similar to the consensus logo obtained jointly with GenBank annotation, but the specific logo of the GenBank shows almost no sequence pattern. This result suggests that the GenBank TIS annotation in S. solfataricus is lower than MED 2.0. Generally speaking, there exists no systematic method to computationally evaluate the accuracy of TIS prediction.
We propose here a computational method to quantitatively estimate the TIS annotation accuracy of a prokaryotic genome; the annotation can be provided by either a program or a database. The method is based on a homogeneity assumption that the sequence patterns represented by a PWM around TISs are homogenous for a generic subset of genes of a genome. The whole set of TIS predictions are split into two sets; set \mathbb{I} is called reference set and is so constructed to be nearly 100% accurate (see section "Reference set") and set \mathbb{O} has only partially accurate prediction which are to be quantitatively evaluated. We assume that the set \mathbb{I} and \mathbb{O} are generic subsets; this assumption is diffcult to prove, but is sound as a first approximation. It is then assumed that the PWM around predicted TISs in the set \mathbb{O} can be modelled as a linear combination of three elementary PWMs, one around true TIS and the others two around false TISs which are located upstream and downstream to the true TIS, respectively. All the three elementary PWMs are obtained from the sequence patterns of the reference set \mathbb{I}, which carries naturally genomespecific features. A generalized least square estimator then determines the weighting of each of the three PWMs, and the weighting of the true TIS naturally determines the accuracy of the TIS annotation in the set \mathbb{O}. Hence, the prediction accuracy over the entire genome, \mathbb{I} ⋃ \mathbb{O}, is derived.
The validity of the method is established with tests on experimentally verified TISs set EcoGene [12]. Then, the method is applied to estimate the TIS annotation accuracy of 532 genomes on the public databases and publicly available programs such as RefSeq [13], ProTISA [14], EasyGene [8, 15], GeneMarkS [7], Glimmer 3 [16] and TiCo [17]. Finally, this analysis has led to a construction of a new TIS database, SupTISA, which is much better than RefSeq on TIS annotations.
Methods
Basic definitions
Let us first introduce several definitions:

a blackboard bold symbol \mathbb{X} denotes a set of genes with specified STOP and TIS;

the sample size of \mathbb{X} is denoted by {\Omega}_{\mathbb{X}};

the symbol {A}_{\mathbb{X}} denotes the accuracy of \mathbb{X};

the symbol \mathbb{S} denotes the set of annotation;

the symbol \mathbb{I} denotes the reference set whose TISs are supposed to be 100% accurate, and the symbol \mathbb{O} denotes its compliment: \mathbb{S}=\mathbb{I}\cup \mathbb{O};

the symbol \mathbb{T} denotes a subset of \mathbb{O} which has correct TIS annotation, and the symbol F denotes its compliment: \mathbb{O}=\mathbb{T}\cup \mathbb{F}. Thus the annotation accuracy of \mathbb{O} can be expressed as {A}_{\mathbb{O}}\triangleq {\Omega}_{\mathbb{T}}/{\Omega}_{\mathbb{O}}. Furthermore, the overall annotation accuracy is given by
{A}_{\mathbb{S}}\triangleq \frac{{\Omega}_{\mathbb{I}}}{{\Omega}_{\mathbb{S}}}+\frac{{\Omega}_{\mathbb{O}}}{{\Omega}_{\mathbb{S}}}{A}_{\mathbb{O}}.(1)
Elementary patterns expressed with PWMs
The main task of this work is to estimate {A}_{\mathbb{O}}. The tool for this evaluation is the PWM of aligned sequences around TIS. We choose l bps upstream and r bps downstream of start codons (in this paper l = 50 and r = 15) to form a window of width l + r. The PWM for the set \mathbb{X} is denoted by {W}_{\mathbb{X}}; concretely, the frequency of nucleotide b at an aligned position j is denoted by W_{ j }(b), where b = 1 denotes adenine (A), b = 2 denotes cytosine (C), and so forth.
Three elementary PWMs will be relevant to our analysis, and correspond to three types of TISs in the annotation. The first is true TIS, and the corresponding PWM is denoted by {W}_{\mathbb{T}}. The second and third are two types of false TIS, whose PWMs will be denoted by {W}_{{\mathbb{F}}_{u}} and {W}_{{\mathbb{F}}_{d}} indicating the false TIS located either upstream or downstream of true TIS, respectively. Note that the overall PWM is, by definition, a linear combination of PWMs of subpatterns, and this linearity has a consequence that any number of subpatterns around true TISs can always be combined to be a single {W}_{\mathbb{T}}, and this is also a valid statement for {W}_{{\mathbb{F}}_{u}} and {W}_{{\mathbb{F}}_{d}} As long as the distribution of subpatterns are uniform for the set \mathbb{I} and \mathbb{O} (which is our homogeneity assumption, see below), it is justified to use the three elementary PWMs to represent an actual observed PWM such as {W}_{\mathbb{O}}.
The difference between the three types of PWMs are biologically clear. {W}_{\mathbb{T}} contains regulatory signals such as the SD sequence, which are required by the translation initiation machinery. Evolution must conserve such pattern. On the other hand, {W}_{{\mathbb{F}}_{u}} characterizes sequences exposed to neutral evolution and hence is generally featureless. Finally, a false TIS located downstream to the true TIS is surrounded by coding sequences and {W}_{{\mathbb{F}}_{d}} exhibits period three oscillations. In Figure 1, we show the three patterns, obtained by our study, for three different organisms. The features discussed above are generally present.
An annotation of finite accuracy will give rise to a PWM which is a linear combination of the above three PWMs. Specifically, for the set \mathbb{O}, we write:
where {\alpha}_{\mathbb{T}}+{\alpha}_{{\mathbb{F}}_{u}}+{\alpha}_{{\mathbb{F}}_{d}}=1. We will develop a least square estimator to determine the three coeffcients α' s from the above four observed PWMs, the first coming from the set \mathbb{O} and the last three from the set \mathbb{I}.
Reference set
The three elementary PWMs are obtained from the reference set, which is very important in this evaluation. The reference set needs to be as reliable as possible, and should not be biased towards any database/predictor to be evaluated. We have chosen to use the six most recent TIS databases/predictors, namely, RefSeq [13], ProTISA [14], EasyGene [8, 15], GeneMarkS [7], Glimmer 3 [16] and TiCo [17], to derive the reference set. For any genome, we obtained the reference TIS set by intersecting the annotations of all six databases/predictors; in order to reduce false positives, genes less than 600 bps are excluded [18]. Among the six annotations, EasyGene, GeneMarkS, Glimmer 3 and TiCo achieve significant improvements on TIS prediction [7, 15–17], and ProTISA is compiled to contain more than 390, 000 confirmed TISs with collected evidence from experiments, literatures, conserved domain search, and sequence alignment between orthologous genes [14]. Today, we can get the intersecting of all 532 genomes found on GenBank from all the databases/predictors except EasyGene, the later only provides annotations of 157 genomes. We will use only five of the six annotations to get the reference set for the remaining 381 genomes. These reference sets represent the best TIS predictions so far achieved, which do not cover all genes (41.5 ± 9.5% in RefSeq) but presumably very accurate. Our present work provide an evaluation for the rest of the predictions, i.e. that of the set \mathbb{O}.
The procedure to obtain three PWMs from the reference set is as follows. Since the true TISs are known, the aligned sequences around the true TISs directly give rise to {\widehat{W}}_{\mathbb{T}} (an estimate of {W}_{\mathbb{T}}, see later). Similarly, one obtains {\widehat{W}}_{{\mathbb{F}}_{u}} and {\widehat{W}}_{{\mathbb{F}}_{d}} by aligning sequences around a randomly chosen false TIS upstream or downstream to the true TIS. Note that {\widehat{W}}_{{\mathbb{F}}_{u}} contains the least feature among the three,. Note also that in this procedure, all three PWMs have the sample size: {\Omega}_{\mathbb{I}}, so there will be finite size effects to be accounted for in the analysis below.
Homogeneity assumption
Finally, let us discuss the limitation of the homogeneity assumption. The sequence pattern encompasses regulatory signals which are important to the translation of genes. The homogeneity property is based on the idea that the translation mechanism is largely universal across a genome. There may be several translation mechanisms acting on a genome [2, 3, 5, 14]; in this case, the homogeneity assumption requires that the proportions of the subpatterns remain the same for different subsets of our interest, namely the set \mathbb{I} and \mathbb{O}. To prove this is a diffcult problem, and we do not intend to accomplish it in this work. This is because that \mathbb{O} might contain TISs with different statistical properties from the reference set, which makes them harder be annotated correctly. However, the validity of our evaluation depends on how large is its effect. A deviation from the homogeneity is similar to the effect of finite accuracy for the set \mathbb{I}, which is easier to study. The testing results (see section "Testing") show that imperfection or bias in the set \mathbb{I} yields definite but small modification of the evaluated accuracy. Therefore, it is reasonable to conclude that the homogeneity assumption is sound to leading order and the results of our evaluation are believable.
Algorithm
Let {\widehat{W}}_{\mathbb{X}} be an estimate of {W}_{\mathbb{X}}. Because we are disposed with a finite set of samples, Eq. 2 becomes
where ε depends on both {\Omega}_{\mathbb{O}} (for {W}_{\mathbb{O}}) and {\Omega}_{\mathbb{I}} (for the three elementary W' s). Furthermore, to eliminate redundancy from data, it is wise to make a Ztransformation [19] from the matrix W of (l + r) × 4 dimensions to a matrix V of (l + r) × 3 dimensions:
where j = 1, 2,..., l+r. Consequently, we rewrite Eq. 3 as
The nucleotide frequencies at different positions in all the PWMs are assumed to be independent [20]. The assumption is widely applied in genefinders [5, 7, 16], and deviations are expected to be small based on results presented in the "testing" section.
Together with the homogeneity assumption, we show that E(ε') = 0 and
where \mathbb{X} takes \mathbb{T}, {\mathbb{F}}_{u} and {\mathbb{F}}_{d}, respectively, and {{\Sigma}^{\prime}}_{\mathbb{X}} is a 3(l+r) × 3(l+r) covariance matrix calculated on the set \mathbb{X} whose components are inferred from {\widehat{W}}_{\mathbb{X}} (see Additional File 1).
The estimation of {\alpha}_{\mathbb{X}}s in Eq. 6 can be done using a generalized least square, namely by minimizing the following weighted sum of squared errors t (see Additional File 1):
where α denotes the vector ({\alpha}_{\mathbb{T}}, {\alpha}_{{\mathbb{F}}_{u}}, {\alpha}_{{\mathbb{F}}_{d}})^{T}and Σ ^{'} denotes Var(ε') for simplification. Because of Eq. 6, Σ' has a complicate dependence on α, and we need to solve a nonlinear optimization problem. This is done by an iterative procedure, with an initial α to evaluate Σ' which is substituted into a group of linear equations of optimization (the firstorder partial derivative of α equals to zero) to calculate new α. The new α is then used to update Σ', and the calculation repeats until α converges. As explained in details in Additional File 1, the calculations converge quickly to correct values. Throughout our tests, we did not encounter any instability for this calculation.
Throughout the calculation, we face a question of how reliable the estimates of {W}_{\mathbb{T}}, {W}_{{\mathbb{F}}_{d}}, {W}_{{\mathbb{F}}_{u}} and {W}_{\mathbb{O}} are, given the finite sample of gene sequences used for the evaluation. This problem is addressed by adopting a bootstrapping strategy for finding a confidence interval (CI) of {\alpha}_{\mathbb{T}}. The calculation is repeated 200 times; each time, we randomly select, with replacement, a sample of TISs from the reference set of size {\Omega}_{\mathbb{I}} to calculate three elementary \widehat{W}s and a sample of TISs from the set \mathbb{O} of size {\Omega}_{\mathbb{O}} to calculate {\widehat{W}}_{\mathbb{O}}, and perform the optimization calculation described above. This calculation is carried out during the testing and every assessment. The consistency of the estimate is then judged by the uncertainty interval of the output accuracy.
Results
Testing
The experimentally confirmed TISs in EcoGene [12], denoted as EcoGene854, allows us to design a procedure to test the reliability of our method. The procedure goes as follows. First, randomly divide genes in EcoGene854 into two equalsizeset \mathbb{O} and \mathbb{I}, and calculate the three elementary PWMs from the set \mathbb{I}, as explained above. Then, we create a series of partially accurate \mathbb{O} with accuracy α from 40% to 90% at a step of 10% by replacing 100(1 α)% of the true TISs by randomly choosing false TISs. The aligned sequences with the newly assigned TISs of the set \mathbb{O} gives rise to {W}_{\mathbb{O}}, which is a simulated real annotation PWM of finite accuracy. The generalized least square calculation determines the estimated accuracy, \widehat{\alpha}. For each α, we repeat the generation of the set \mathbb{I} and \mathbb{O} (200 times) and obtain a distribution of \widehat{\alpha} from which the average and standard deviation of \widehat{\alpha} can be derived.
In Figure 2, we plot the average estimate and the standard deviation of \widehat{\alpha} as a function of true accuracy α. When the reference set is 100% accurate, \widehat{\alpha} fluctuates around α with ± 2.6%. We have also found that the estimate \widehat{\alpha} is unbiased.
An intriguing question is what happens if the reference set is not 100% accurate. This can be easily checked by carrying out a series of tests with varying accuracy of \mathbb{I} (by randomly replacing a portion of true TISs by false ones): {A}_{\mathbb{I}} = 90%, 95% and 100%. The results are also shown in Figure 2. Generally speaking, an overestimation of the accuracy is obtained. This is readily understood because when {A}_{\mathbb{I}} < 100%, {W}_{\mathbb{T}} contains contribution from false TIS, and hence a bias is generated in favor of false annotation, and the estimated accuracy is higher. Specifically, we found that the estimated accuracy is about \widehat{\alpha}=\alpha /{A}_{\mathbb{I}} if the reference set has an accuracy of {A}_{\mathbb{I}}. This effect is consistent in both \mathbb{I} and \mathbb{O}. As for the set \mathbb{I}, the real accuracy is {A}_{\mathbb{I}} while the estimated accuracy is {A}_{\mathbb{I}}/{A}_{\mathbb{I}} = 100%. So, the estimation is inversely proportional to the accuracy of the reference set. This dependence on {A}_{\mathbb{I}} is very helpful to keep in mind when one interprets an actual assessment.
The above designed tests provide a unique opportunity to test if a bootstrapping strategy offers any knowledge about the uncertainty of the estimate. We carried out a bootstrapping calculation for the runs with {A}_{\mathbb{O}} = 60% and {A}_{\mathbb{I}} = 100%, the widths of the obtained 95% CIs are shown in Figure 3 as a function of {\Omega}_{\mathbb{I}} (by taking only a subset of genes from the set \mathbb{I}). This dependence has an advantage to be compared to real assessment calculation. As shown, the width of the 95% CI follows approximately a power law dependence on {\Omega}_{\mathbb{I}}, which is a result of the nonlinear optimization. At the largest set size of \mathbb{I}, the width is around 13%, which is about 30% wider than that derived from the actual distribution of \widehat{\alpha} (which is around 10.0% for an approximate normal distribution with standard deviation 2.6%, as above). In other words, the bootstrapping calculation overestimates the scattering of estimated {\alpha}_{\mathbb{T}}, and hence it provides a good and conservative measure of the reliability of \widehat{\alpha}. When we extended the 95% CI results for the testing to those in real assessment with the actual {\Omega}_{\mathbb{I}} in E. coli, we find that they agree remarkably well. This confirms the validity of the bootstrapping calculation. Generally speaking, Figure 3 shows that, for typical genomes with {\Omega}_{\mathbb{I}} ~1000 – 2000, the obtained assessment accuracy would have a width of 95% CI of 5% to 8% (equivalently ± 1% to ± 2% in standard deviation). This is rather a satisfactory outcome.
Applications
Assessing RefSeq annotation
RefSeq is the most widely used public database on TIS, and its accuracy is the most concerned matter of this study. We have conducted an overall assessment on the TIS annotation for RefSeq. A total of 532 genomes are assessed. The annotation accuracy varies widely from 3.3% in A. baumannii ATCC 17978 to 96.8% in P. pentosaceus ATCC 25745 with an average of 80.6 ± 9.9%. About 40% of the genomes have accuracies higher than 85.0%, including genomes from several well studied genera such as Bacillus, Escherichia, Salmonella and Pseudomonas. In contrast, 13.5% of the genomes, most of which are GCrich, have very suspicious TIS annotations with accuracies lower than 70%. A complete list of estimated accuracies for the 532 genomes is available in Additional File 2.
Below, we examined two annotation preferences that potentially contribute to the RefSeq annotation quality, namely tendencies to overannotate LORF and to underannotate ATG start codon.
As reported previously [7, 8], RefSeq tends to overannotate LORF. If the TIS annotation takes the rule of LORFs (i.e., always taking the 5'most start codon), then its TIS accuracy would equal to the percentage of LORF in all true TISs (which will be referred below to as the percentage of true LORF). Our method can define a way to estimate this percentage of true LORF. For a genome for which we can generate a reliable reference set, then we can generate an artificial annotation by adopting the LORF rule. The final estimated accuracy of this artificial annotation is the percentage of true LORF. This method is applied to Y. pestis, and the estimated percentage of true LORF is 63.7%. The calculation of the actual percentage of LORF in the RefSeq annotation for Y. pestis is 92.6%. We then judge that there is about 30% overannotation of LORF in this genome. This study is carried out for a total of 532 genomes, and the results are shown in Figure 4 where we found an average of 7.6 ± 9.1% overannotated of LORFs in RefSeq.
Another preference is the underannotation of ATG start codon, for which we have now developed some statistical measures to provide further quantitative evidence. We have conducted calculation within genus, a taxonomic category ranking below family but above species. It is reasonable to expect that the TISs of species from the same genus show little difference in statistic such as the start codon usage. A total of 29 genera containing at least five selected genomes are studied, and the Escherichia genus is chosen to present our results; reported observations hold on most of the other genera (see Additional File 3). As shown in Figure 5, the percentage of annotated ATG start in the CFT073 strain is about 70%, whereas this percentage in the wellstudied K12 strain reaches a much higher value of 90%. Note that the percentage calculated from the confirmed EcoGen854 data set is about 91%. Our estimated accuracy of TIS annotation for CFT073 strain is below 70%, significantly lower than the K12 strain (about 94%). Figure 5 shows a clear linear correlation between the ATG start codon usage and the accuracy for all strains in the Escherichia genus.
Assessing other TIS annotations
Since our reference set is constructed with the intersection of all relevant TIS databases/predictors, it is not biased towards any one, and hence we can carry out the analysis of accuracy for all of the predictors for the 532 genomes. This subsection is devoted to a discussion of their performances. We chose RefSeq as a standard of accuracy comparison for presenting the results. To reduce false positives, genes not annotated by RefSeq and genes with length short than 300 bps were excluded, as implied in [8, 18]. Figure 6 shows the accuracy difference of the five other TIS predictors to RefSeq predictions.
As two of the most popular genefinders, Glimmer and GeneMark have been used to annotate hundreds of genomes. The most recent versions, Glimmer 3 and GeneMarkS, include a RBS model to predict TISs, which is in a form of PWM whose parameters are derived by a Gibbs sampler. High performances are reported on two wellstudied genomes E. coli K12 and B. subtilis [7, 16]. When assessed here on the 532 genomes by our method, Glimmer 3 and GeneMarkS report average accuracies of 83.6 ± 7.4% and 85.0 ± 7.3%, respectively, which are higher than RefSeq (80.6 ± 9.9%). However, for a considerable numbers of genomes, Glimmer 3 and GeneMarkS have made limited or no improvement over RefSeq (see Figure 6).
EasyGene has only published 157 genomes [21] and is believed to be a better TIS resource than RefSeq [8]. Indeed, our assessment confirmes that, for its reported set of genomes, EasyGene's annotation has a noticeably high average accuracy of 86.7 ± 6.3%, which is about 10% higher than the RefSeq prediction for the same set (see Figure 6). Note that EasyGene is reported to make conservative choices in gene prediction [8, 15], and it excludes often 5–10% genes of RefSeq, some of which have function annotation.
Unlike gene finders, TiCo is a postprocessor of an existent annotation. High performance was reported on E. coli K12 and B. subtilis, as well as on GC rich genomes such as P. aeruginosa PAO1 [17]. As shown in Figure 6, the improvement on RefSeq is indeed remarkable, with an average improvement on accuracy of 10% over all 532 genomes. Note, however, that the accuracy improvement is relatively lower in GCrich and ATrich genomes (in the wing part of Figure 6).
ProTISA is a recently published database dedicated to TIS annotation in prokaryotic genomes. It is generated by collecting various confirmed TISs and predictions from MEDStart (upgraded), which postprocesses the RefSeq annotation [14, 22, 23]. The assessment carried out here indicates that, over 532 genomes, the ProTISA has a mean accuracy of 90.5%, which is 9.9% higher than RefSeq (see Figure 6). As a more rigorous comparison, we applied the pairedsamples ttest to judge if ProTISA gives a significantly higher accuracy than RefSeq, and obtained a tvalue of 31.2, much larger than 1.7 at 95% confidence and for degrees of 531. Thus, a definite positive answer is derived. Besides, there are 101 genomes for which ProTISA's prediction accuracy is higher by 15% than RefSeq; for these genomes, the RefSeq shows a clear preference of overannotating LORF and underannotating ATG start codon (data not shown). In addition to accurate TIS annotation, ProTISA annotates potential regulatory signals, which are helpful in investigating the diversity of translation initiation mechanism. For example, besides SD signal, Pribnow box is found at 10 bps upstream to TIS in many bacteria genomes, suggesting that leaderless gene may not be rare in bacteria [14].
A new TIS database: SupTISA
The method of evaluation proposed in this paper is based on a fundamentally different principle, the principle of homogeneity for the PWMs of any subset of genome as a linear combination of three elementary PWMs. This principle is based on the universal process of gene translation, and it is a macroscopic property for the ensemble of TISs. This information is supplementary to the properties that are used by TIS predictors, and hence can (and should) be used to provides a complementary way for achieving the global annotation performance. In other words, we propose to construct a new TIS annotation database by selecting the best TIS predictor's annotation for any given genome; the resulting annotations organize a new database (of 532 genomes at present) and is named SupTISA [24]. This is possible because the assessment is totally independent and unbiased.
Specifically, for each genome, SupTISA selects the one of RefSeq, ProTISA, EasyGene, GeneMarkS, Glimmer 3 and TiCo with the highest accuracy as the SupTISA annotation and provides its downloading at the web address [24] for TIS annotations. SupTISA takes advantage of the complementary quality of all the existing TIS predictors. For instance, ProTISA are generally better than TiCo on genomes with biased GC content, but TiCo outperforms ProTISA on others (data not shown). The result is that SupTISA achieves an average annotation accuracy of 92.1 ± 4.7%. Figure 6 shows that SupTISA generally overperforms RefSeq by 5% to 30%. Therefore, SupTISA provides the best resource for experimental use and for computational study related to TIS.
Discussions and Conclusion
Translation is a fundamental process for an organism, and the regulatory signals relevant to this process should have relatively uniform distribution across a genome. A PWM of aligned sequences around TIS summarizes the statistical information of the signal, and is then a tool to use for study how much, in a given set of annotation, the true signal has contributed. This is the principle we use for inventing, for the first time, an algorithm for largescale evaluation of TIS's prediction accuracy. The work done on the testing with confirmed genes and on assessing six databases/predictors over 532 genomes give rise to a series of consistent results. Although the actual accuracy results may be subject to a few percents of uncertainty, due to statistical fluctuations of finite sample sizes and possible distortion of the reference sets, the assessments seem to be a valid leading order measure of the TIS annotations. Such assessment is meaningful, especially when the estimated accuracy is low: typically, some unjustified or simplified assumptions are used during the annotation. Our assessment then provides a tool for experimental or computational biologists to avoid to be misled by an oversimplified annotation. We have shown that the RefSeq annotations for some genomes are of this nature.
Correct annotation is important to both in vivo and in silico studies of translation initiation. In P. horikoshii OT3 and several other archaeal genomes, Cang and Wang [25] reported a high frequency of ATG triplets at 9 bps downstream of annotated TISs using GenBank's TIS annotation data. It was suggested that "a remedial initiation site for archaea ... reflect the decreased effciency of the translation initiation machinery in archaea". However, after taking a refined dataset of TIS annotations from the present database, such unexpected overfrequency disappeared (data not shown). It is then likely that the observed excess of ATG triplets is due to the excess of false TIS upstream to the true TISs present in the study. As an additional outcome, the present assessment yields a new database, called SupTISA. The interest of SupTISA lies in the fact that none of the TIS predictors is able to correctly take into account all properties of sequences around TIS over the entire family of prokaryotic genomes, and hence SupTISA can integrate them as a more macroselector. This is achieved because SupTISA is based on a macroscopic principle (over all genomes) of homogeneity of translation machinery.
References
Shine J, Dalgarno L: The 3'terminal sequence of E. coli 16S RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci U S A 1974, 71: 1342–1346. 10.1073/pnas.71.4.1342
Londei P: Evolution of translational initiation: new insights from the archaea. FEMS Microbiol Rev 2005, 29: 185–200. 10.1016/j.femsre.2004.10.002
Torarinsson E, Klenk HP, Garrett RA: Divergent transcriptional and translational signals in Archaea. Environ Microbiol 2005, 7: 47–54. 10.1111/j.14622920.2004.00674.x
Gold L: Posttranscriptional regulatory mechanisms in Escherichia Coli . Annu Rev Biochem 1988, 57: 199–233. 10.1146/annurev.bi.57.070188.001215
Zhu HQ, Hu GQ, Yang YF, Wang J, She ZS: MED: a new nonsupervised gene prediction algorithm for bacterial and archaeal genomes. BMC Bioinformatics 2007, 8: 97. 10.1186/14712105897
Poole FL, Gerwe BA, Hopkins RC, Schut GJ, Weinberg MV, Jenney FEJ, Adams MW: Defining genes in the genome of the hyperthermophilic Archaeon Pyrococcus furiosus : implications for all microbial genomes. J Bacteriol 2005, 187: 7325–7332. 10.1128/JB.187.21.73257332.2005
Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a selftraining method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 2001, 29: 2607–2618. 10.1093/nar/29.12.2607
Nielsen P, Krogh A: Largescale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 2005, 21: 4322–4329. 10.1093/bioinformatics/bti701
Starmer J, Stomp A, Vouk M, Bitzer D: Predicting ShineDalgarno sequence locations exposes genome annotation errors. PLoS Comput Biol 2006, 2: 57. 10.1371/journal.pcbi.0020057
Frishman D, Mironov A, Gelfand M: Starts of bacterial genes: estimating the reliability of computer predictions. Gene 1999, 234: 257–265. 10.1016/S03781119(99)002000
Gorodkin J, Heyer LJ, Brunak S, Stormo GD: Displaying the information contents of structural RNA alignments: the structure logos. Comput Appl Biosci 1997, 13: 583–586.
Rudd KE: EcoGene: a genome sequence database for Escherichia coli K12. Nucleic Acids Res 2000, 28: 60–64. 10.1093/nar/28.1.60
Pruitt K, Tatusova T, Maglott D: NCBI reference sequences (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35: D61D65. 10.1093/nar/gkl842
Hu GQ, Zheng XB, Yang YF, Ortet P, She ZS, Zhu HQ: ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genome. Nucleic Acids Res 2008, 36: D114D119. 10.1093/nar/gkm799
Larsen TS, Krogh A: EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics 2003, 4: 21. 10.1186/14712105421
Delcher AL, Bratke KA, Powers EC, Salzberg SL: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 2007, 23: 673–679. 10.1093/bioinformatics/btm009
Tech M, Meinicke P: An unsupervised classification scheme for improving predictions of prokaryotic TIS. BMC Bioinformatics 2006, 7: 121. 10.1186/147121057121
Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A: On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 2001, 17: 425–428. 10.1016/S01689525(01)023721
Zhang CT, Zhang R: Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res 1991, 19: 6313–6317. 10.1093/nar/19.22.6313
Staden R: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 1984, 12: 505–519. 10.1093/nar/12.1Part2.505
Zhu HQ, Hu GQ, Ouyang ZQ, Wang J, She ZS: Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics 2004, 20: 3308–3317. 10.1093/bioinformatics/bth390
ProTISA[http://mech.ctb.pku.edu.cn/protisa/]
Cang XH, Wang J: A unique ATG triplet downstream of gene start in archaea: implications for translation initiation and evolution. Gene 2004, 327: 75–79. 10.1016/j.gene.2003.11.001
Acknowledgements
We thank Yongchu Liu, Xiaojing Yang, Yifan Yang and Xinqiu Yao for beneficial discussions. The work received partial support by the National Natural Science Foundation (10225210 and 30770499) of China and was also supported by the 973 Project grant 2003CB715905.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
ZSS and GQH conceived the study, designed the applications and drafted the manuscript, ZSS and HQZ cosupervised the development of the work, XBZ and GQH designed and implemented the algorithm. LNJ performed part of the test. All authors read and approved the final manuscript.
GangQing Hu, Xiaobin Zheng contributed equally to this work.
Electronic supplementary material
12859_2007_2145_MOESM2_ESM.xls
Additional file 2: Estimated TIS annotation accuracies of six selected databases/predictors. Accuracies of TIS annotation on a total of 532 genomes for RefSeq, Glimmer 3, GeneMarkS, EasyGene, TiCo and ProTISA. (XLS 173 KB)
12859_2007_2145_MOESM3_ESM.pdf
Additional file 3: Correlation between annotation accuracy and ATG start codon usage. A total of 29 genera were selected. The linear fit was applied if the Pearson Correlation is significant at 95% confidence. (PDF 116 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Hu, GQ., Zheng, X., Ju, LN. et al. Computational evaluation of TIS annotation for prokaryotic genomes. BMC Bioinformatics 9, 160 (2008). https://doi.org/10.1186/147121059160
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059160
Keywords
 Prokaryotic Genome
 Translation Initiation Site
 Sequence Logo
 Homogeneity Assumption
 Annotation Accuracy