Gene finding in novel genomes

Background Computational gene prediction continues to be an important problem, especially for genomes with little experimental data. Results I introduce the SNAP gene finder which has been designed to be easily adaptable to a variety of genomes. In novel genomes without an appropriate gene finder, I demonstrate that employing a foreign gene finder can produce highly inaccurate results, and that the most compatible parameters may not come from the nearest phylogenetic neighbor. I find that foreign gene finders are more usefully employed to bootstrap parameter estimation and that the resulting parameters can be highly accurate. Conclusion Since gene prediction is sensitive to species-specific parameters, every genome needs a dedicated gene finder.


Background
Complete genomic sequences are becoming more and more abundant. Given a new genome, one of the first and most important tasks is determining the structure of its protein-coding genes. Ab initio gene prediction plays a critical role because it produces gene structures quickly, inexpensively, and in some cases reliably. The accuracy of a gene finder depends on many factors. Chief among these is proper training. Training a gene finder can be a laborious task. In the past, genome projects moved rather slowly, and the trickle of data meant that parameters could be estimated before the genome was complete. Today, sequencing and assembly are so rapid that genomes can appear well in advance of proper training material. This leaves some new genome annotation projects without an appropriate gene finder.
When new genomic sequence emerges, it must be annotated with something, and frequently this is a gene finder for a completely different genome. A particularly good example of this is Genscan [1], which was trained primarily for the human genome, but has been used to annotate genes in worms, flies, fish, fungi, amphioxus, and others [2][3][4][5][6]. Since sequence features such as codon bias [7] and splicing signals [8] vary from organism to organism, it is expected that gene finders may not perform optimally in a foreign genome. While the practice of annotating a genome with a foreign gene finder is commonplace, its consequences are not widely understood.
In this paper, I introduce a new ab initio gene finding program called SNAP. SNAP is similar to Genscan and other generalized hidden Markov model (HMM) gene finders [9][10][11][12][13], but unlike many, it is easily adaptable to a number of organisms and its source code is freely available. I train and evaluate SNAP in the Arabidopsis thaliana (thale cress), Caenorhabditis elegans (nematode), Drosophila melanogaster (fruit fly), and Oryza sativa (rice) genomes, and demonstrate that SNAP is an accurate ab initio gene finder. For newly sequenced genomes without a specific gene finder, I show that annotating with a foreign gene finder may be highly inaccurate and I introduce a bootstrapping procedure that allows one to estimate species-specific parameters de novo.

Results and discussion
SNAP is a high-performance ab initio gene finder I trained and evaluated SNAP in four genomes (see Methods) and compared its performance to Genscan in all genomes, to HMMGene [12] and Genefinder [14] in C. elegans, and to Augustus [15] in D. melanogaster. Genscan performs as well as recent gene finders designed specifically for Arabidopsis [16], was considered one of the standards for the Drosophila GASP experiments [3], and is one of the gene finders used by the International Rice Genome Sequencing Project [17]. HMMGene and Genefinder are well-established gene prediction programs for C. elegans. Augustus is one of the latest gene prediction programs and has been shown to outperform Genscan, GENIE, and GENEID in Drosophila.
As shown in table 2, SNAP is more accurate than Genscan in every genome. In C. elegans, SNAP performs better than HMMGene and almost as well as Genefinder. In D. melanogaster, SNAP is similar to Augustus. The HMMs employed by SNAP in this study have a minimal genome model without a promoter, poly-A signal, or UTRs, and the reason why SNAP outperforms Genscan is simply that it is trained for each genome. When compared to gene finders tuned for a particular genome, SNAP performs about equally. It may be possible to increase the accuracy of SNAP by including more states in the HMM to model additional genomic features or by using more sophisticated statistical techniques such as interpolated Markov models, maximum dependence decomposition trees or isochore segmentation. Fine-tuning SNAP to particular genomes is not the subject of this study however.

Gene prediction in novel genomes can be highly inaccurate
A newly sequenced genome may not have much training material and little experimental data to anchor gene predictions. How can one find genes in such uncharted territory? The common procedure is to use a gene finder from some other genome, perhaps the one that is most phylogenetically similar. To determine the consequences of this practice I evaluated the difference between the intra-and inter-species performance of SNAP. The results are displayed in table 3.  Gene prediction accuracy with foreign parameters appears to follow GC content more than phylogenetic relationships. For example, Oryza parameters perform reasonably well in Drosophila sequence (>25% genes correct) but very poorly in A. thaliana (5% genes correct). Similarly, for finding C. elegans genes, one is better off with parameters from A. thaliana than D. melanogaster. Choosing the best foreign gene finder is therefore not simply a matter of using parameters from the closest relative.

Genomes have significant compositional differences
To look more closely at the reasons behind the inaccuracy of foreign parameters, I have examined compositional differences in coding sequence, splice sites, and the translation start. Figure 2 displays the codon frequencies of degenerate codons for each of the four genomes. In general, codon preference is reflected by GC composition. That is, the GC-rich genomes prefer G and C in the 3rd position and the AT-rich genomes prefer A or T. This helps to explain the results of the previous experiment. But even between genomes with similar GC-content there are significant differences among equivalent codons. For example, Oryza prefers CTC for Leucine while Drosophila prefers CTG.
Pictograms [18] for splice sites and the translation start are shown in figure 3. The two plant genomes (Arabidopsis and Oryza) have very similar acceptor sites, but the two animal genomes (Caenorhabditis and Drosophila) each have unique features. The upstream region from -7 to -27, frequently known as the poly-pyrimidine tract is T-rich in plants, AT-rich in Caenorhabditis, and pyrimidine-rich (proximally) in Drosophila. T is the most common nucleotide at positions at -5 and -6 in all genomes, but in Caenorhabditis these are almost invariant. At -4, the plants prefer G while Caenorhabditis prefers T and Drosophila is unbiased. The splice donor sites appear to be similar among all the genomes. However, there may be speciesspecific higher-order contexts that contribute to some specificity since the tuned architectures (see Methods) are slightly different for each genome. The translation start site appears to have some genome specificity. Given the compositional differences in the various signals, it is not surprising that gene prediction with foreign gene finders can be highly inaccurate.

Parameter estimation in novel genomes
Even though foreign gene finders may perform sub-optimally, their predictions may display compositional properties of the novel genome. For example, when annotating the A. thaliana genome with a C. elegans gene finder, the predictions appeared very much like real A. thaliana genes. Figure 3d shows a splice acceptor pictogram derived from these predictions. Note that the sequence composition broadly resembles true A. thaliana splice acceptors, including a preference for G at -3 and a T-rich upstream sequence. It also retains some C. elegans qualities such as a greater proportion of Ts at -5 and -6.
If predicted genes have compositional properties similar to real genes, it should be possible to train a gene finder for a novel genome in the absence of any data. One simply runs a foreign gene finder (or more than one) to create a virtual training set and estimates parameters for the novel genome from the virtual data. So rather than use foreign gene finders to identify genes, one uses them to bootstrap parameter estimation. To determine if this procedure works, I assumed one of the genomes was novel and eval- In O. sativa, bootstrapped parameters are only somewhat helpful. The Arabidopsis and Caenorhabditis parameters improve in both sensitivity and specificity, but the Drosophila foreign parameters are actually better than any of the bootstrapped ones. The reason why O. sativa behaves differently from the others is unclear at this time. In general, however, estimating parameters from gene predictions appears to be a simple and convenient way to train a gene finder for a novel genome. It is important to note that the genomes studied in this paper are all relatively compact. The techniques here may not work well in mammalian genomes or other genomes where exons account for a small fraction of the total sequence. I am currently investigating methods to improve gene finding accuracy and to quickly estimate gene prediction parameters in large genomes.

Conclusions
In this paper, I demonstrated that SNAP is a high performance gene finding program in several genomes. For new genomes without an appropriate gene finder, I showed SNAP HMM state diagram

Data sets
Genomic data and annotation were retrieved from two sources. The genomic sequence and annotation were subjected to a battery of tests to select for complete genes with no obvious errors such as overlapping exons, exons out of bounds, exons on the wrong strand, in-frame stop codons, or introns <30 bp. In addition to these "sanity checks", every gene was confirmed by an end-to-end, gapfree alignment between the in silico predicted transcript and a full-length cDNA. This requirement ensures that there are no extra/missed exons and that predicted splice sites are consistent with the actual transcript. Genes with unusual features such as non-canonical splice sites, internal stops, or split start/stop codons were removed. To simplify evaluation, genes with known alternate forms or genes overlapping other genes have been removed. To limit redundancy, all encoded proteins within a genome d were aligned to each other, and in cases where there was a BLASTP [22,23] alignment of >90% identity, one of the sequences was removed. The version of BLAST used was 2.0MP-WashU [10-Sep-2003] and the parameters were the defaults plus -wordmask = seg + xnu. Bioperl [24] libraries were used for processing BLAST reports. For the A. thaliana set there were a large number of genes that passed all criteria, so the final set was trimmed by random selection to make it similar to the others.

Codon frequency
The amount of sequence upstream and downstream of each gene is variable; the distance is half way to the nearest annotated gene up to a maximum of 1 kb. The average intergenic length in the four genomes is 2.9 kb, 2.7 kb, 6.6 kb, and 4.1 kb for Arabidopsis, Ceanorhabditis, Drosophila, and Oryza. 1 kb was chosen because at longer lengths some unannotated genes were clearly present (data not shown). The GC content, proportion of single-exon genes, and average exon and intron lengths are all similar to the downloaded data for each genome (data not shown).
The gene sets used in this study are summarized in table 1. The Arabidopsis and Caenorhabditis genomes are very ATrich while the Drosophila and Oryza genomes have a more balanced nucleotide composition. Some genomes have unique features when compared to the others: Caenorhab-ditis has very few single-exon genes, Drosophila has much longer exons and introns, and Arabidopsis has the shortest introns on average.

Gene model
SNAP models protein coding sequences in genomic DNA via a specialized hidden Markov model similar to the one used in Genscan [1]. There are a few key differences: 1. Genscan is described as having three intron states but SNAP uses six to prevent stop codons at splice junctions (Genscan does not predict genes with stop codons, but its method to prevent stop codons is not described).
2. Genscan models both strands simultaneously while SNAP treats each strand independently. Decoding the strands independently allows genes on opposite strands to overlap. The advantage this is that it allows genes within introns of other genes, which is relatively common in some genomes. The disadvantage is that it also allows overlapping exons, which is not common.
3. The state diagram is fixed in Genscan but is read from a parameter file in SNAP. This allows one to change the HMM to describe a variety of genomic features. The sim- plest state diagram for predicting single-and multi-exon genes is shown in figure 1.
4. The sequence feature model architectures in Genscan are fixed but SNAP allows one to employ any length weight matrix and any order Markov model and to embed these models within an array, decision tree, or 3-periodic (coding sequence) framework.
5. Introns may have explicit length distributions over a fixed distance followed by a geometric tail. Using explicit lengths increases the computational load by an amount that depends on the length of the explicit distribution and on exon density. The total runtime is approximately 1.5 times as long for these genomes with a fixed distance of 250 bp.

Parameter estimation
In gene prediction algorithms it is common to employ weight matrices (WM), weight array matrices (WAM), and Markov models (MM) to model compositional features such as the translation start site, splice sites and codon bias [1]. The choice of architectures is partly limited by the amount of data in the training set, but is ultimately determined by what works best in practice. To find a useful combination of models I began with simple weight matrices for the splice acceptor, splice donor, translation start, and translation stop sites, and 4 th order Markov models for coding, intron, and intergenic sequence. I did not include models for promoter, poly-A site, UTRs or trans-splicing. These features are often not annotated and I therefore used only the features which could be unambiguously defined in the training data.
The "default" models are as follows: the acceptor WM is 30 bp long with 3 exonic nucleotides, the donor WM is 9 bp with 3 exonic nucleotides, the start WM is 12 bp with 6 coding nucleotides, and the stop WM is 9 bp with 3 bp on either side of the stop codons. The WMs (except the stop which is not shown) are therefore accurately represented by the pictograms in figure 3. These default architectures were used in the gene prediction experiments corresponding to table 4. For the experiments corresponding to tables 2 and 3, the architectures were tuned to find those that work best in each organism. The number of possible combinations of architectures is very large and was not fully explored. Tuned architectures differ from the default by changing the length of the WMs (exonic portions are kept constant) or by adding context to each position (which changes a WM to a WAM). The changes from the aforementioned default architectures are as follows: In A. thaliana, the acceptor is a 20 bp 1 st order WAM, the donor is a 9 bp 1 st order WAM, and the start is a 12 bp 1 st order WAM. In C. elegans, the acceptor is a 15 bp WM, the donor is a 9 bp 2 nd order WAM, and the start is a 18 bp 2 nd order WAM. In D. melanogaster, the acceptor is a 40 bp 2 nd order WAM, the donor is a 15 bp 2 nd order WAM, and the start is a 18 bp 1 st order WAM. In O. sativa, the acceptor is a 40 bp WM. In addition, the intron models in A. thaliana and C. elegans used explicit durations for a length of 250 bp.

Performance evaluation
All intra-species performance figures were calculated using a 5-fold cross-validation strategy (non-overlapping sets were trained on 4/5 of the data, gene prediction was evaluated on 1/5 of the data, and the performance figures were averaged over the 5 sets). Inter-species performance did not employ 5-fold cross-validation since the training and testing sets are from different genomes. Prior to gene prediction, all sequences were masked with RepeatMasker [25] with appropriate libraries using the MaskerAid [26] enhancement. The repeat library for O. sativa was obtained from TIGR [19]. Other repeat libraries were part of the RepeatMasker distribution.
Genscan was run with the A. thaliana parameter file for A. thaliana sequences and the human parameter file for the others. Genefinder and HMMGene were run with C. elegans parameters. Augustus was run with the "fly_singlestrand_partial.cfg" parameter file. Sensitivity [true positives/(true positives + false negatives)] and specificity [true positives/(true positives + false positives)] were calculated at the nucleotide, exon, and gene levels. Gene sensitivity (fraction of genes predicted exactly correct) is also reported as gene accuracy.

SNAP implementation
SNAP is written in ANSI C as a command line Unix program. The source code is covered by the GNU General Public License and is freely available from the author. SNAP is a relatively efficient program in both memory and time; it requires about 100 megabytes of memory and 30 cpu seconds to decode a 1 Mbp sequence on a 1 Ghz machine. CPU time and memory scale linearly with sequence length.