Prodigal: prokaryotic gene recognition and translation initiation site identification
© Hyatt et al. 2010
Received: 20 July 2009
Accepted: 8 March 2010
Published: 8 March 2010
Skip to main content
© Hyatt et al. 2010
Received: 20 July 2009
Accepted: 8 March 2010
Published: 8 March 2010
The quality of automated gene prediction in microbial organisms has improved steadily over the past decade, but there is still room for improvement. Increasing the number of correct identifications, both of genes and of the translation initiation sites for each gene, and reducing the overall number of false positives, are all desirable goals.
With our years of experience in manually curating genomes for the Joint Genome Institute, we developed a new gene prediction algorithm called Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm). With Prodigal, we focused specifically on the three goals of improved gene structure prediction, improved translation initiation site recognition, and reduced false positives. We compared the results of Prodigal to existing gene-finding methods to demonstrate that it met each of these objectives.
We built a fast, lightweight, open source gene prediction program called Prodigal http://compbio.ornl.gov/prodigal/. Prodigal achieved good results compared to existing methods, and we believe it will be a valuable asset to automated microbial annotation pipelines.
Microbial gene prediction is a well studied, and some would say solved, problem, but the truth is that there is still much room for improvement, especially in understanding how translation initiation mechanisms work in prokaryotes. Existing methods for bacterial and archaeal gene prediction include the popular Glimmer  and GenemarkHMM  packages, both of which are included at NCBI alongside Genbank  annotations (Prodigal is also included), as well as other methods such as Easygene  and MED .
Current gene recognition methods perform relatively well in low GC content genomes, but the accuracy drops considerably in high GC content genomes. High GC genomes contain fewer overall stop codons and more spurious open reading frames (ORFs). These false ORFs are often selected by programs instead of real ORFs in the same genomic region. In addition, the longer ORFs in high GC genomes contain more potential start codons, thus leading to a drop in accuracy of the translation initiation site (or TIS) predictions as well.
Translation initiation site prediction in existing microbial gene-finding tools has not proven to be sufficiently adequate, and this has motivated a number of tools to be developed specifically to correct the start calls of current methods. These tools include GSFinder , TiCO , and TriTISA . It is our view that a single gene prediction algorithm should be able to match the performance of the above methods, rather than needing to run two programs to attain the desired level of accuracy in start predictions.
Finally, most methods tend to predict too many genes. Although many of the short genes predicted by current programs that have no existing BLAST  hits might be real, the likelihood is that most are false positives. We base this assertion on the fact that genome-wide proteomics studies that search the entire set of all potential ORFs do not identify a significant number of peptides in these genes . In the construction of a novel algorithm, we determined it would be preferable to sacrifice some genuine predictions if it meant also eliminating a much larger number of false identifications.
With the advent of faster sequencing technologies, it is likely that in the future less time will be spent on finishing microbial genome sequence. It is also likely that researchers will not often be able to curate manually the gene predictions delivered by automated pipelines. It is therefore important to improve the current methodologies to obtain higher quality gene predictions, better translation initiation site predictions, and a reduction in the number of false positives.
To address these challenges, we constructed a novel gene-finding algorithm called Prodigal. In designing the Prodigal algorithm, we decided to use a "trial and error" approach. We began by building a set of curated genomes that had been analyzed using the JGI ORNL pipeline http://genome.ornl.gov/. This pipeline consisted of a combination of Critica  and Glimmer , BLAST  to locate missing genes and correct errors, and a final round of manual expert curation. To this initial set of ten genomes we added Escherichia coli K12 (both the Genbank file and the Ecogene Verified Protein Starts data set ), Bacillus subtilis, and Pseudomonas aeruginosa. With these sets in hand, it became possible to validate or exclude changes to the algorithm based on whether or not the performance on the test set of genes increased or decreased, respectively. In the final stages of validating the rules in the program, we expanded this set to include over 100 genomes from Genbank.
It should be noted that we only used this set to determine very general rules about the nature of prokaryotic genes, such as gene size, maximum overlap between two genes (both on the same strand and on opposite strands), and RBS motif usage. In addition, we tuned several constants in the program based on performance on this data set. This set was also used to exclude ideas that caused deterioration in performance across many genomes. (These failed ideas are too numerous to include in this publication). Because we intended to validate Prodigal's performance by examining E. coli, B. subtilis, and P. aeruginosa, we also verified that each of these decisions we made also maximized performance on the remaining genomes in our set. Changes were not retained if they were merely "local" improvements to a subset of genomes, especially not genomes on which we intended to test the program's performance.
In order for Prodigal to run in a completely unsupervised fashion, it needed to be able to learn all the necessary properties of the input organism, including start codon usage (ATG vs. GTG vs. TTG), ribosomal binding site (RBS) motif usage, GC frame plot bias, hexamer coding statistics, and other information necessary to build a complete training profile. To gather statistics from a finished sequence or set of sequences, the algorithm first had to determine automatically a set of putative "real" genes on which to train.
Prodigal constructs its training set of genes by examining the GC frame plot in the ORFs in the genome. The program begins by traversing the entire input sequence and examining the bias for G's and C's in each of the three codon positions in each open reading frame. The highest GC content codon position for an ORF is considered the "winner", and a running sum for that codon position is incremented. Once all ORFs have been processed, the sums give an approximate measure of the preference of each codon position for G and C. The values for each codon position are normalized around 1 and divided by 1/3. If 2/3 of the codons in ORFs prefer G or C in the third position, for example, then the bias score for that position would be 2. We tried converting this bias to a log score, but this was found to decrease the quality of the results.
where B(i) is the bias score for codon position i, and l(i) is the number of bases in the gene where the 120 bp maximal window at that position corresponds to codon position i.
With this preliminary coding score measure based on simple GC codon position statistics, Prodigal scores every start-stop pair above 90 bp in the entire genome. (We tried allowing genes smaller than this, but the number of false positives became problematic.) Prodigal then performs a dynamic programming  across the whole sequence (or set of sequences) to identify a maximal "tiling path" of genes to train on. The purpose of this dynamic programming method is to force the program to choose between two heavily overlapping ORFs in the same genomic context. In theory, one of these ORFs should match the preferred GC codon position of the organism, whereas the other one should not.
Dynamic Programming Connections in Prodigal
Score of 2nd gene
Score of 2nd gene
Opposite Strand Overlap
Score of 2nd gene
where C is the coding score, G is the percentage occurrence of that word within our gene training set, and B is the percentage occurrence of that word across the entire sequence (irrespective of frames). So, for example, if a word is twice as likely to occur in a gene as it is in the background, the score for that word would be log(2). This corresponds to a 5th-order Markov model [1, 2]. A floor and a ceiling are also established on this score to handle cases where there is insufficient data for a given word.
where S is the sum of the coding scores (C) for the in-frame hexamers (the set of words w) in the gene. In addition, Prodigal modifies this coding score based on information about what lies upstream of the selected start. For example, if a gene 1000..3000 has a score of 500.0, and the gene 1200..3000 has a score of 400.0, Prodigal modifies the score of the second gene to be 400-(500-400) = 300. The reason for this modification is to penalize choosing a truncated version of a gene when a longer, higher-scoring version of the same gene could also be chosen. In the dynamic programming model, this can be thought of as penalizing a connection to an interior start by subtracting the difference between the two potential genes. The purpose for this modification is to discourage the truncation of genes through choosing a gene on the opposite strand that overlaps with and erases the beginning of the longer version of the gene, a common occurrence in current gene-finders. In addition, Prodigal implements a few more minor tweaks to the coding score, including boosting the score of particularly long genes (dependent on the GC content of the organism: ~700 bp or so in low GC, ~1200 or so in high GC) to be minimally positive if the preliminary coding score is negative.
Once Prodigal has calculated coding potential scores for every start-stop pair in the genome, the next step is to create a translation initiation site scoring system from the training set. The program constructs a background of ATG, GTG, and TTG frequencies off all start nodes in the genome. It also builds a background of RBS motifs based on the Shine-Dalgarno sequence . Unlike many methods, which use a position-weight matrix or Gibbs sampling method to find motifs, Prodigal begins by assuming that the SD motif will be utilized by the organism. If this turns out not to be the case, it runs a more rigorous motif finder. But, to start with, the program attempts to determine if the SD motif is widely utilized by the genome in question.
Shine-Dalgarno RBS Motifs in Prodigal
GGA, GAG, AGG
GGA, GAG, AGG, AGxAG, GGxGG
AGGA, GGAG, GAGG, AGxAGG, AGGxGG
GGA, GAG, AGG
AGGAG, GGAGG, AGGAGG
AGGA, GGAG, GAGG
AGGA, GGAG, GAGG
GGA, GAG, AGG
where S is the score, R is the observed percentage of this type in our training set, and B is the percentage occurrence in the background. This method is used both for start codon usage (ATG, GTG, or TTG) as well as for the SD bin motif (from the table above). These scores are summed together and multiplied by a constant (4.25, corresponding to about 16 bp of coding score, determined empirically from maximal performance on our test set of genomes, and later verified on a larger set of genomes from Genbank), then added to the coding score. Prodigal goes through every start-stop node and performs this calculation, modifying the default coding score by the quality of its start codon information. This leads to a new set of "peaks" for the set of training ORFs. For example, an ATG with a slightly lower coding score than a TTG in the same ORF could overtake it with the additional start score added (assuming the organism uses ATG as a start codon more than TTG).
Once a new set of peaks has been determined, Prodigal reconstructs the background for both SD motif and start codon usage. In this iteration and in subsequent ones, it no longer assumes a higher numbered bin is better for RBS motifs, and it instead relies on the log likelihoods calculated in the previous iteration to find the best upstream motif for a given start site. Prodigal performs several iterations of this process, moving the peaks around based on subsequent information until they no longer move significantly. When the peaks no longer move, it determines the final set of weights based on statistics gathered from this final set of putative "real" start codons.
The end result is a set of log-likelihood weights for ATG/GTG/TTG information and for each of the above RBS bins. If the zero bin for RBS motifs, which corresponds to no SD motif, is positive, or if the zero bin is above -0.5 and the 4-base motif bins are less than 1.0, then Prodigal determines that this organism does not use the SD motif strongly, and it runs a more rigorous motif finder. In examining over 800 finished genomes in Genbank, we determined about 10% of them did not use the SD motif strongly. Most of these genomes were cyanobacteria, chlorobii, or archaea, which seem to use different translation mechanisms than the more common SD motif.
If it is found that the organism does not use the SD motif, Prodigal searches exhaustively for alternative motifs. It does so by looking at the occurrence of all 3-mer motifs in the initial set of peaks, and locating all 3-mers that occur in at least 20% of the high-scoring gene models. From these motifs, it then performs an iterative algorithm similar to the above. The bins instead correspond to every word of size 3-6 bp (mismatches allowed only in the center of 5-6 bp words, just as in the SD RBS motif table above) with every potential spacer size (3-4 bp, 5-10 bp, 11-12 bp, and 13-15 bp). All words 3-6 bp that do not occur frequently enough are combined in the "no RBS motif" bin. Prodigal then arrives at a similar set of weights for no RBS motif, as well as for each 3-6 bp motif that contains the commonly occurring 3 bp motif as a subset. In Aeropyrum pernix, a strong GGTG motif is located, whereas in many cyanobacteria, Prodigal latches onto AT-rich motifs like TATA and TAAA.
Finally, we added a scoring system to capture information in the regions outside those examined by the RBS scorer (1-2 bp and 15 bp to 45 bp upstream from the translation start site). This scoring system builds a position weight matrix on the whole region. Although this scoring system is very crude and captures only general characteristics (AT-richness, simple base preferences, etc.), it was found to be quite effective in some genomes. This generic upstream scoring system is not part of the iterative algorithm; the data is instead gathered from the final iteration of the start training.
in which S is the final score, R is the RBS motif score, T is the start type score, U is the upstream score, and C is the coding score. For the RBS weight, Prodigal uses the SD motif score if it determines that the organism uses Shine-Dalgarno, the secondary RBS motif score if it finds a clear-cut secondary motif, and the maximum of the two systems if neither system located a strong RBS motif. This latter method was shown to work well in some genomes such as cyanobacteria and crenarchaea that tended to have AT-rich upstream regions but still occasionally used the SD motif for some genes (such as ribosomal proteins).
A linear combination of the various elements was the first method we tried, and it worked well enough that we did not pursue other strategies. It may be that there exists a better method of integrating the different signals (perhaps a neural network or some other classifier), but this will have to be examined in future versions. The 4.25 and 0.4 constants were arrived at by experimenting with different values and observing the change in results across our test set of genomes. We chose the values such that they maximized performance across the entire set. In order to rule out bias in E. coli, B. subtilis, and P. aeruginosa, we also verified that the same approximate constants maximized performance on our set of genomes with those three excluded.
False positive reduction is an important goal in Prodigal. In order to reduce the number of overall predictions, Prodigal modifies the above start weight (4.25) based on the length of the gene. In examination of numerous genomes, we determined that approximately 250 bp is the point of equilibrium at which a gene with a positive coding score is equally likely to be a false positive or a true prediction. Genes less than 250 bp are therefore penalized according to their length divided by 250. If the start score is greater than 0, it is reduced to l/250*s, where l is the length of the gene. If the start score is less than 0, it is instead multiplied by 250/l*s. Finally, for all genes with negative coding scores, regardless of length, the start score is penalized by a small amount to prevent genes with moderately good start scores but bad coding scores from drifting above zero.
Once the scores have been calculated, the dynamic programming is performed a second time, using the more detailed node scores described above for the gene connections. For intergenic connections, operon distance provides a stronger weight in the second pass of dynamic programming. When two genes overlap by 1 or 4 bp, if the second gene lacks an RBS and has a negative RBS score, the requirement of an RBS is lifted and the score is increased to 0. In addition, the program adds small bonuses for distances less than 60 bp, and small penalties for distances greater than 180 bp. These distances correspond roughly to observed operon distances . Although dynamic programming has order n log n, we limit the valid connections by distance, such that "long" connections can only be made between the start of a really long gene and its stop codon. The end result is that Prodigal must make a connection generally within 5 kb, so that it must choose a gene in this region, even if its score is negative. When the dynamic programming is complete, however, the program makes a final sweep through the models and removes any such genes with negative scores. In addition, the algorithm makes one final improvement to start calls that proved to be significant in our test set. When two starts are separated by a distance of less than 15 bp (determined empirically from our test set), Prodigal sets the coding of the two choices to be equivalent and uses only the start score (based on RBS motif and start type) to determine which start to choose for the final gene prediction.
Prodigal runs very quickly, analyzing a 4 MB genome in about 20 seconds on a typical workstation. It is also extremely easy to use relative to other methods, consisting of only a single executable that can be run without the user needing to supply any organism-specific parameters. A web server has also been implemented at http://compbio.ornl.gov/prodigal/. The latest source code for Prodigal is available via the same web site, and version 1.20 has been included as an additional file [Additional File 1].
Assessing the performance of microbial gene-finding programs remains a difficult task due to the lack of experimentally verified gene start sets. The EcoGene Verified Protein Starts set  remains the only large set of experimentally verified genes and translation start sites for typical bacteria. In addition, another large set was produced for two archaea, Halobacterium salinarum and Natronomonas pharaonis, using a special proteomics technique for extracting N-terminal peptides . In addition to the above sets, numerous smaller sets exist for other genomes, but most of these are also atypical genomes such as cyanobacteria and archaea. Nonetheless, these still provide a set of experimentally verified genes with which to test the accuracy of start site and gene predictions. For these genomes, we relied on the data set from the ProTisa database of confirmed translation initiation sites . However, some of the genes in the Synechocystis set were inconsistent with annotations in the Genbank files, and subsequent manual inspection proved the Genbank files to be correct. Therefore, we removed genes from this set, as well as a few genes from the other genomes, that disagreed with the Genbank annotations. We extracted all sets with more than 50 experimentally determined translation initiation sites, although we excluded numerous relatives of E. coli (and other genomes) in ProTisa whose starts were verified only through similarity search.
Gene Prediction Performance
Escherichia coli K12
Mycobacterium tuberculosis H37Rv
As can be seen in Table 3, Prodigal proved equal or better at locating genes in every organism with a few exceptions: Glimmer 3  and EasyGene  in P. aeruginosa, and GenemarkHMM  in N. pharaonis. Prodigal also performed equal to or better than the other tools in translation initiation site prediction with a few exceptions: GenemarkHMM  and TriTisa  on B. subtilis, and TiCo , TriTisa , and EasyGene  on Haemophilus influenzae. Prodigal performs equal or better at locating existing genes, while also providing comparable performance in translation initiation site prediction to the start correction tools.
The above test set contains many unusual genomes. Bacillus subtilis is remarkable for its extremely strong use of the SD motif. Halobacterium salinarum and Natronomonas pharanois make only limited use of the SD motif. Aeropyrum pernix contains a different RBS motif in GGTG. Synechocystis PCC6803, like most cyanobacteria, does not seem to use the SD motif at all, and instead favors AT-rich regions upstream of its translation start sites. A quick scan of 700 finished genomes in Genbank with Prodigal's SD-motif routine revealed that 88% of them used the SD motif. It is our assertion that E. coli, B. subtilis, and P. aeruginosa from the table above provide a more typical look at performance in the vast majority of sequenced microbial genomes.
Comparison with Genbank Annotations
Genbank Genes with no Joins
Escherichia coli K12
These results cannot be seen as definitive, however, as it is always possible Prodigal's algorithm contains a bias that is shared by whatever methods were used to create the original Genbank files. In order to rule this bias out, we examined ATG usage and "leftmost start" usage in each of the methods, but we could find no obvious bias shared by Prodigal and Genbank annotations relative to the other methods. If anything, Prodigal seemed to call more starts internally and to truncate more genes, than the other methods. Although the quality of the Genbank files is impossible to estimate, we included the above results to demonstrate the concept of "genome-wide" performance, an important factor in microbial annotation pipelines.
Number of Genes Predicted By Each Method
Escherichia coli K12
EasyGene  predicts fewer genes than all other methods on every genome, with the one exception of Pseudomonas aeruginosa; however, Easygene  is also less sensitive than the other programs (as can be seen in tables 3 and 4). It is likely the program could be improved on these genomes simply by using a less stringent R-value threshold, though this would lead to an increase in the number of genes predicted. Prodigal predicts equal or fewer genes vs. the remaining methods (excepting EasyGene) in all cases except Halobacterium salinarum vs. Genemark , while still retaining excellent sensitivity in locating genes. The gaps in B. subtilis and Synechocystis are particularly noticeable.
In the future, we hope to improve Prodigal's recognition of short genes, atypical genes, translation initiation mechanisms, and genomes. With a more detailed look at cyanobacteria and archaea, in general, it should be possible to build a better start site prediction algorithm than the one currently in place for non-SD motifs. Also, identifying laterally transferred genes, genes in phage regions, proteins with signal peptides, and any other genes that do not match the typical GC frame bias for the organism in question, are areas where Prodigal can improve. We will also seek to develop a version of Prodigal to address the rapidly growing metagenomic data for microbial organisms.
We developed a new gene-finding program for microbial genomes called Prodigal. The goals of Prodigal were to attain greater sensitivity in identifying existing genes, to predict translation initiation sites more accurately, and to minimize the number of false positive predictions. The results of Prodigal were compared to existing methods for both purely experimentally verified genes as well as curated Genbank files for a number of genomes. Prodigal's performance was found to be comparable or better to existing methods in the prediction of genes while also predicting fewer overall genes. In the prediction of translation initiation sites, Prodigal performed competitively with existing methods. Prodigal is currently already in use at many institutions, and it has been used to annotate all finished microbial genomes submitted to Genbank by DOE-JGI in 2008 and onward (a substantial percentage of the overall finished microbial genomes at NCBI). It is run regularly at NCBI alongside GenemarkHMM  and Glimmer , and it has also been incorporated into the Swiss Institute of Bioinformatics microbial genomics browser . In conclusion, Prodigal should prove to be a valuable resource for genome annotation of either draft or finished microbial sequence.
Project Name: Prodigal
Project Home Page: http://compbio.ornl.gov/prodigal/
Operating System: Any
Programming Language: C
License: GNU GPL
The authors wish to acknowledge Robert W. Cottingham, Edward C. Uberbacher, Cynthia Jeffries, and Yun-Juan Chang for helpful discussions and suggestions. The authors also wish to acknowledge Mark Borodovsky for helpful discussions and the provision of the Genemark software. Support for this research was provided by the U.S. Department of Energy, Office of Science, Biological and Environmental Research Programs. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.