Comparison of computational methods for identifying translation initiation sites in EST data
© Nadershahi et al 2004
Received: 13 August 2003
Accepted: 16 February 2004
Published: 16 February 2004
Skip to main content
© Nadershahi et al 2004
Received: 13 August 2003
Accepted: 16 February 2004
Published: 16 February 2004
Expressed Sequence Tag (EST) sequences are generally single-strand, single-pass sequences, only 200–600 nucleotides long, contain errors resulting in frame shifts, and represent different parts of their parent cDNA. If the cDNAs contain translation initiation sites, they may be suitable for functional genomics studies. We have compared five methods to predict translation initiation sites in EST data: first-ATG, ESTScan, Diogenes, Netstart, and ATGpr.
A dataset of 100 EST sequences, 50 with and 50 without, translation initiation sites, was created. Based on analysis of this dataset, ATGpr is found to be the most accurate for predicting the presence versus absence of translation initiation sites. With a maximum accuracy of 76%, ATGpr more accurately predicts the position or absence of translation initiation sites than NetStart (57%) or Diogenes (50%). ATGpr similarly excels when start sites are known to be present (90%), whereas NetStart achieves only 60% overall accuracy. As a baseline for comparison, choosing the first ATG correctly identifies the translation initiation site in 74% of the sequences. ESTScan and Diogenes, consistent with their intended use, are able to identify open reading frames, but are unable to determine the precise position of translation initiation sites.
ATGpr demonstrates high sensitivity, specificity, and overall accuracy in identifying start sites while also rejecting incomplete sequences. A database of EST sequences suitable for validating programs for translation initiation site prediction is now available. These tools and materials may open an avenue for future improvements in start site prediction and EST analysis.
Complete sequences of the mouse and human genomes are available; completion of additional animal genomes is imminent. Effective methods for identifying genes, and the proteins they encode, have become increasingly important. Although most genes can be identified through the open reading frame (ORF) of the protein they encode, detection in eukaryotic genomic sequence is more difficult since these genes are fragmented into small exons (averaging 145 bp in human), extending across large regions (averaging 27 kb in human) .
Eukaryotic gene-discovery can be most effectively accomplished through direct sequencing of gene transcripts using cDNA libraries . Because cDNAs represent processed mRNAs, intervening sequences have been removed, and ORFs can more easily be deduced. Due to cost and time constraints, most high-throughput cDNA sequencing efforts rely on end-sequences from cDNA clones that vary in length, and thus represent different portions of the mRNAs from which they derive.
These end sequences, called expressed sequence tags (ESTs), are generally single-strand, single-pass sequences, only 200–600 nucleotides long, contain errors leading to frame shifts, and represent different parts of the parent cDNA . Comparison of ESTs to each other, and to genome sequence, is useful for gene discovery. Comparison of ESTs from different cDNA libraries may yield information about gene expression and alternative mRNA processing. Furthermore, ESTs can be used as 'tags' to identify genes and to probe the genome for matching sequences, such as in the construction of genome maps. As a result of their usefulness, large numbers of ESTs have been generated in both the public and private sectors; in 2001, ESTs made up more than 60% of all of the nucleotide sequence database entries .
ESTs also provide a resource for determining the complexity and quality of cDNA libraries, including identifying full-length cDNA clones suitable for isolation and functional analysis. A full-length cDNA should encompass all sequences from the CAP site to the poly (A) addition site. However, a cDNA comprising at least the entire ORF, from translation initiation site (TIS) to termination codon, is worthy of high accuracy re-sequencing and/or protein functional analysis. In fact, successful identification of the TIS alone leads to simple determination of the termination codon, if present. For this reason, most methods for determining the completeness of ESTs, and by extension the cDNAs from which they originate, focus on the TIS. This study reviews and compares – both qualitatively and quantitatively – the major computational methods and tools for identifying TISs and determining completeness of ESTs.
The majority of eukaryotic mRNAs have one open reading frame and a single functional TIS, usually the AUG codon closest to the 5'-end . The "scanning hypothesis" postulates that a 40S ribosomal subunit binds initially at the 5'-end of an mRNA and migrates linearly in a 3' direction until it reaches the first AUG codon [6–8]. If the first initiation codon lies in a suitable context (e.g., GCC [AG]CCatgG, Kozak's consensus) the 40S ribosomal subunit migrates no further, is joined by the 60S ribosomal subunit, and the complex initiates protein synthesis [5, 6]. When the context is less than favourable, some protein synthesis may occur there, but most will start at the next downstream AUG codon .
Though Kozak's consensus has very good validity in vertebrate mRNAs , further analyses has revealed variation in the initiation context between different groups of eukaryotes . Furthermore, despite the utility of Kozak's consensus in identification of TISs in mRNAs, EST data poses numerous problems that render the consensus sequence much less useful for it. The main problem involves the generality of the consensus sequence; while the absence of the pattern will usually exclude an ATG from being the initiation codon, the pattern is general enough to match many other ATG triplets in each sequence. In the case of an incomplete EST lacking the true initiation site, relying solely on Kozak's consensus would result in the false prediction of the most 5' Kozak consensus being the initiation site. Additional features are required to identify TISs in ESTs, such as the positioning of a Kozak's consensus sequence relative to a significant open reading frame.
Several computational tools have been developed to assist in this identification. Some methods, such as conditional probability matrices , consider only the nucleotides in the vicinity of ATGs. Other methods, such as NetStart , consider larger regions. ATGpr  considers a variety of factors. Still others, such as ESTScan  and Diogenes , though not specifically designed to identify TISs, perform very well in identifying open reading frames and might be expected to be useful for predicting EST completeness.
Available for download
Locally written using Microsoft Excel 
Kozak, in 1989, reported that less than 10% of all eukaryotic mRNAs do not use the first ATG for the start codon . If this remains true, it should therefore be possible to predict TIS with 90% accuracy by just selecting the first (most-5') ATG. However, this is only true for complete, error-free, mRNA sequences. The situation is very different with ESTs, which, as mentioned above, are partial, single-pass cDNA sequences. ESTs have more errors than genomic sequences and may represent different regions of the mRNA – in some cases lacking the true TIS. For these reasons, prediction of the TIS in an EST may benefit from consideration of TIS context. However, evaluating the simple method of choosing the first ATG can reveal the extent of the above problems. Furthermore, the first-ATG method serves as a meaningful baseline to use with more sophisticated methods.
Several programs distinguish between coding sequences and non-coding sequences based solely on the intrinsic properties of the nucleotide sequences, as opposed to using homology information. The most successful programs are GenScan  for genomic DNA and ESTScan  for ESTs. ESTScan is of particular interest for this study because of its potential to determine completeness of ESTs. ESTScan implements a fifth-order hidden Markov model that recognizes coding sequences by oligonucleotide frequencies. Additionally, ESTScan corrects for sequence errors, which could be an especially helpful feature for analyzing ESTs. Although ESTScan does not incorporate a model of the TIS, it does predict the beginning of the coding sequence. This prediction may not be very accurate – indeed, it may not even correspond to an ATG – but ESTScan's detection of coding sequences makes this program potentially useful for evaluating the EST completeness. An updated version is available .
Diogenes , developed at the University of Minnesota, is somewhat similar in purpose to ESTScan; it finds ORFs in short sequences. Diogenes identifies ORF candidates by scanning all six reading frames for stretches of sequence uninterrupted by stop codons. Various organism-specific statistical measures, such as codon frequency and ORF length, are then used to estimate the likelihood that these ORF candidates encode proteins. A quadratic discriminant statistic combining these various factors is reported as an overall score for the reliability of the final ORF prediction. Like ESTScan, Diogenes does not incorporate a model of the TIS. However, Diogenes also reports the predicted beginning of the coding sequence that may be useful for evaluating the EST completeness.
NetStart , perhaps the most popular and accessible program for TIS prediction, analyzes a larger region – up to 100 bases upstream and 100 bases downstream of a putative start codon. NetStart uses an artificial neural network to predict the initiation site from this large fixed-length window around the potential start codon. Based on a training data set of conceptually-spliced mRNA derived from genomic sequences with known start sites, the neural network 'learned' on its own which features are indicative of a true TIS. This approach is especially appealing due to the complexity of translation initiation.
ATGpr  considers as many as six characteristics of the EST sequence in analyzing the context of a putative TIS:
Positional triplet weight matrix around the ATG; the propensity for a particular triplet to be in a specific position relative to the ATG.
Frequencies of in-frame hexanucleotides downstream of the ATG; favors longer reading frames with suitable hexanucleotide compositions.
Hexanucleotide difference before and after the ATG; these regions correspond to the putative 5' untranslated region (UTR) and the putative open reading frame, respectively; the difference between these 50-nucleotide regions should be greater for real start codons.
Likelihood of a signal peptide being present, based on the presence of hydrophobic 8-residue peptides within a 30 amino acid window downstream of the ATG.
Presence of another upstream in-frame ATG, which decreases the likelihood of the ATG under analysis being the true initiation codon according to the ribosome scanning model of translation initiation .
Upstream cytosine nucleotide presence; based on the observation that 5' untranslated regions of human genes are often rich in cytosine.
Each characteristic can distinguish true from false initiation sites. Reportedly, the most important features for correct predictions are the positional triplet weight matrix around the ATG and the hexanucleotide difference before and after the ATG . A linear discriminant function is used to combine the statistical measures of these six features into a final score. Like NetStart, ATGpr was trained on conceptually-spliced mRNA derived from genomic sequences with known start sites.
A major limitation of previous studies of methods for TIS prediction concerns the test datasets used. Several of the early computational methods for TIS and coding region prediction were evaluated before a large amount of EST data was available, and thus used instead mRNA or conceptually-spliced mRNA. Such datasets fail to capture the problems unique to EST data (described above). Furthermore, lack of consistency in data and types of data used for evaluating the different methods renders comparison problematic at best. Study of methods for TIS prediction would therefore benefit from a single dataset that is representative of the type of data seen in practical applications. This study benchmarks the key computational tools with a relevant dataset.
The five methods described above were applied to dataset of 50 EST sequences with, and 50 without, translation initiation codons. In order to simulate the practical use of these methods in actual EST projects, only the top scoring ATG from each sequence is predicted to be the initiation codon, given that the corresponding score is above the threshold value under consideration.
Simply predicting whether or not EST sequences contain the TIS may be very useful for some EST projects. It can indicate which region of the gene is represented by the EST sequence as well as roughly assess the completeness of the EST's 5' end. Accordingly, this study evaluates the ability of ESTScan, Diogenes, Netstart, and ATGpr to predict the presence or absence of TIS.
A. ROC curve values for each program.
95% CI of Area
0.776 to 0.924
0.613 to 0.817
0.605 to 0.806
0.410 to 0.638
B. ROC curve values for comparisons between two programs.
NetStart v Diogenes
NetStart v ATGpr
Diogenes v ATGpr
ESTScan v ATGpr
NetStart v ESTScan
Diogenes v ESTScan
The low accuracy of ESTScan points to the potential drawbacks of using this method for identifying TIS (or even their presence or absence) rather than for its more conventional use of detecting coding regions. However, the features considered by Diogenes were found to be sufficient to predict the presence or absence of start sites with moderate reliability.
Overall, ATGpr is shown to be effective in identifying TISs in EST sequences as well as in rejecting sequences that lack a true TIS. While an accuracy of 76% for prediction of true TISs leaves room for improvement, ATGpr achieves levels of sensitivity, specificity, and overall accuracy that are suitable for practical application. Furthermore, ATGpr's high accuracy in the dataset of sequences containing true TISs indicates that this method will become more useful as methods for generating 5'-complete ESTs improve.
Interestingly, ATGpr was found to be generally more effective than NetStart. Considering that ATGpr is based on a predetermined set of rules whereas NetStart utilizes artificial neural networks, ATGpr's favorable results indicate that improved understanding of the mechanism of translation initiation may lead to greater ability to identify translation initiation sites. Both programs might benefit from being retrained on newer, larger datasets, preferably consisting of ESTs instead of conceptually-spliced mRNA sequences.
The main aim of this study was to compare and contrast the performance of several algorithms in identifying TISs in EST data. However, combined analysis of the results from all of the algorithms yields additional information. For example, analysis of an EST corresponding to the human MSMB locus (GenBank ID BF679106) resulted in identical predictions by firstATG, NetStart, and ATGpr (TIS at nucleotide position 34) that are consistent with annotation at ENSEMBL  and GenBank , but in disagreement with annotation at ProtEST  (TIS at position 232), the 'gold standard' used in this study.
In another example, an EST corresponding to the human RanBPM gene (Genbank ID AA311767) was one of the original 50 sequences with known TISs in the validation set. The consensus prediction by these same three algorithms (TIS at position 221) was in disagreement with annotation at ProtEST (TIS at position 336). The ProtEST annotation was consistent with experimental data  that reported RanBPM was a 55 kDa protein. A more recent analysis  revealed the molecular mass to be 90 kDa; the AA311767 EST corresponds to a 5'-truncated cDNA, so contains no TIS. The RanBPM EST was replaced in the final validation set. However, as these examples demonstrate, neither computational nor molecular approaches are completely accurate in predicting the presence or location of TISs in ESTs.
Disagreement between an annotated TIS location and predictions corroborated by more than one algorithm can suggest problems with annotation, incomplete ESTs, and/or cDNA truncation. Differing results from multiple algorithms could also theoretically be used to identify alternative start codons and upstream open reading frames. Important future work would be to automate the extraction of such data from combined or serial analysis using multiple algorithms.
In principle, many coding sequences (and thus also TISs) can be characterized through alignment with homologous proteins. Several programs are capable of aligning nucleotide sequences to protein sequences databases; BLASTX  and FASTX/FASTY  are among the most popular. However, there are several major limitations to this approach. First, aligning a nucleotide sequence to protein sequences is more prone to error than other types of alignment due to the multiple reading frames and possible false hits from 5' or 3' untranslated regions. On the other hand, searching for matches in a nucleotide database does not guarantee that the matched sequences represent the complete gene. Perhaps most importantly, approaches based on alignment rely on homologous proteins and therefore cannot be used to find novel genes.
Methods based solely on the analysis of intrinsic properties of nucleotide sequences therefore seem to be the most promising – and perhaps the only useful – approaches for identifying TIS in EST data. Since living cells' translation machinery is able to identify start sites without using homology information, it should in principle be possible for computer programs to do the same .
Still, homology can be used to ease the task of identifying TISs. Homology searches can be used early in an EST project to determine which ESTs correspond to previously identified genes. Having thus narrowed down the dataset, the scientist can then focus on the remaining, novel genes.
Also, homology can be used to increase the accuracy of TIS prediction, particularly in borderline cases. Nishikawa et al.  add similarity information to ATGpr score, slightly improving sensitivity and specificity. The program, ATGpr_sim, was not available for evaluation in this study.
Despite the discovery of Kozak's consensus sequence and its apparently important role in translation initiation, it is not truly understood how this consensus modulates ribosomal scanning of mRNAs. Specifically, it is not clear why the ribosome pauses at ATG sites characterized by Kozak's consensus. A better understanding of the requirements for ribosome scanning and – more importantly – of the context in which the ribosome pauses to initiate translation could lead to more reliable methods for identifying translation initiation sites. For example, mRNA secondary structure immediately downstream of the initiator ATG has been shown to play a role in translation initiation . The context around initiation sites thus appears to be significantly more complex than current models. The superior performance of ATGpr in this project supports the notion of initiation sites being distinguished by a variety of features. ATGpr indeed bases its predictions on six types of sequence information. Even NetStart's neural networks apparently failed to capture the complexity of TISs.
Some of the complexities in transition initiation have recently been reviewed . The most relevant to this project is the occurrence of multicistronic eukaryotic mRNAs. Though the majority of eukaryotic mRNAs have one TIS, some have more. In some of the multicistronic sequences, intercistronic distances are small (80 – 150 nucleotides) and upstream ORF(s) are short (< 30 nucleotides). Thus even short ESTs may have more than one valid TIS. It is important to develop methods to identify multiple TISs when they occur in ESTs, and to distinguish between TISs that initiate translation of short polypeptides and those that initiate much longer proteins.
Another area that deserves more attention is the 5' untranslated region (UTR). As described above, the ribosome binds the mRNA at the 5' cap region at the 5' terminus. This means that the entire length of the 5' UTR is passed over by the ribosome before it initiates protein synthesis. More detailed knowledge of the 5' UTR might provide insight as to why this region is passed over by the ribosome, possibly even clarifying why some first-ATGs are not true TISs.
Analysis and annotation of EST data would of course benefit from higher quality EST sequences, or even higher quality reference cDNA sequences. Oligo-capping  allows collection of full-length cDNA sequences by recognizing the cap structure and introducing an oligomer RNA at the 5' end of the mRNA. Comparison of ESTs to homologous sequences in oligo-capped cDNA libraries could vastly improve determination of the 5'-completeness of ESTs and thus improve EST analysis and annotation.
Gene identification is one of the major tasks of bioinformatics. As high throughput methods have facilitated complete genome sequencing, the importance of identifying coding regions has become more evident. Analyzing sequences from cDNAs is the most direct way to identify and characterize the coding regions. The structural annotation of genes in genomic sequences will therefore likely depend on cDNA analysis until/unless more efficient methods are developed. Accordingly, the number of novel cDNA and EST sequences is growing quite rapidly. Yet relatively few programs can reliably determine the completeness of EST sequences.
However, there has been recent rapid progress in the development of new methods for determining the 5'-completeness of EST sequences by identifying TISs. This project assesses the problem of EST analysis in the broader context of genomics and gene discovery, reviews the key concepts and relatively new methods for identifying translation initiation sites, as well as comparing the performance of these methods. Our analysis has confirmed that although detection of the presence and extent of an open reading frame is valuable, further information is required to accurately predict TISs in EST data. ESTScan and Diogenes did well in predicting that a sequence contains a CDS, the purpose for which these programs were developed. This capability is distinct from that required to identify start codons, as revealed by their poor performance in identifying the presence and position of TISs in the test set.
A successful method for identifying TISs has been identified in this paper. ATGpr demonstrated relatively high sensitivity, specificity, and overall accuracy in identifying start sites while also rejecting incomplete sequences. Including information on similarity to known protein sequences in later versions of ATGpr indicate that this method can provide more reliable information for annotating EST and cDNA sequences. Furthermore, advanced methods for generating ESTs, such as oligo-capping, which lead to full-length cDNAs, will improve EST databases, ultimately resulting in more reliable analysis and annotation of novel genes.
UniGene  is a system of GenBank sequences partitioned into non-redundant gene clusters. It contains the sequences of well-characterized genes as well as hundreds of thousands of EST sequences. UniGene Build #160: Homo sapiens was the initial sequence source. On February 16, 2003, it contained 111,064 clusters and 4,020,822 EST sequences.
Human genomic sequences containing the annotation "complete CDS" (complete coding sequence) were selected as starting points for gene selection. A filter for RefSeq (NCBI Reference Sequence) entries was applied to ensure that the dataset was non-redundant, up-to-date, and composed of valid entries . The resultant 2074 sequences were filtered to include only those with links to UniGene clusters. Of these 371 UniGene clusters, 50 clusters containing ProtEST links were randomly selected. ProtEST  provides protein matches for ESTs, and ensures that the matches exclude conceptual translations by using sequences only from Swissprot, PIR, PDB, and PRF. Finally, from this strict set of UniGene clusters, two 5'-EST sequences were selected randomly from each cluster: one containing the TIS and one lacking the TIS, confirmed by visual alignment with the reference sequence. The type of ESTs generated (5' versus 3') depends on the directionality of the primers used in vitro. A total of 100 EST sequences were used: 50 containing and 50 lacking the TIS.
The EST sequences were entered into the five programs: first-ATG, ESTScan, Diogenes, NetStart, and ATGpr. All of these methods except for first-ATG were accessed via their web sites (see Table 1). First-ATG was performed through Microsoft Excel  spreadsheet functions. Performance of each method was measured in terms of sensitivity and specificity of EST 5'-completeness predictions (in other words, presence versus absence of the TIS), and of percentage accuracy of predicting the position of the TIS or lack thereof. With the exception of the first-ATG method, all of the methods report a score along with the prediction. This permits users to employ custom thresholds. The statistical measures described above were calculated across all threshold scores. Statistical analyses were performed using Analyse-it statistical software for Microsoft Excel .
We thank Eric Klee and Stephen Ekker for helpful discussions. This work was supported in part by NIH R01-GM63904.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.