Ornithine decarboxylase antizyme finder (OAF): Fast and reliable detection of antizymes with frameshifts in mRNAs

Background Ornithine decarboxylase antizymes are proteins which negatively regulate cellular polyamine levels via their affects on polyamine synthesis and cellular uptake. In virtually all organisms from yeast to mammals, antizymes are encoded by two partially overlapping open reading frames (ORFs). A +1 frameshift between frames is required for the synthesis of antizyme. Ribosomes change translation phase at the end of the first ORF in response to stimulatory signals embedded in mRNA. Since standard sequence analysis pipelines are currently unable to recognise sites of programmed ribosomal frameshifting, proper detection of full length antizyme coding sequences (CDS) requires conscientious manual evaluation by a human expert. The rapid growth of sequence information demands less laborious and more cost efficient solutions for this problem. This manuscript describes a rapid and accurate computer tool for antizyme CDS detection that requires minimal human involvement. Results We have developed a computer tool, OAF (ODC antizyme finder) for identifying antizyme encoding sequences in spliced or intronless nucleic acid sequenes. OAF utilizes a combination of profile hidden Markov models (HMM) built separately for the products of each open reading frame constituting the entire antizyme coding sequence. Profile HMMs are based on a set of 218 manually assembled antizyme sequences. To distinguish between antizyme paralogs and orthologs from major phyla, antizyme sequences were clustered into twelve groups and specific combinations of profile HMMs were designed for each group. OAF has been tested on the current version of dbEST, where it identified over six thousand Expressed Sequence Tags (EST) sequences encoding antizyme proteins (over two thousand antizyme CDS in these ESTs are non redundant). Conclusion OAF performs well on raw EST sequences and mRNA sequences derived from genomic annotations. OAF will be used for the future updates of the RECODE database. OAF can also be useful for identifying novel antizyme sequences when run with relaxed parameters. It is anticipated that OAF will be used for EST and genome annotation purposes. OAF outputs sequence annotations in fasta, genbank flat file or XML format. The OAF web interface and the source code are freely available at and at a mirror site .


Background
Ornithine Decarboxylase Antizymes are important negative regulators of cellular polyamine levels. In mammals, antizyme-1 inhibits ornithine decarboxylase (ODC), an enzyme catalyzing the first and rate-limiting step in polyamine biosynthesis. Antizyme-1 binds to ODC and targets it for ubiquitin-independent degradation by the 26S proteosome in a multiple-turnover manner (a single antizyme molecule can cause degradation of several ODC molecules) [1,2]. Additionally, antizyme-1 regulates the intracellular concentration of polyamines by inhibiting cellular import of polyamines and accelerating polyamine export from the cell [3][4][5]. While genomes of lower eukaryotes contain single antizyme genes, multiple paralogs have evolved in higher eukaryotes, with at least two antizymes in vertebrates [6,7], three in mammals [8,9] and up to five in certain fish species [10]. Antizyme paralogs vary somewhat in their function, although all are implicated in the regulation of polyamine synthesis (and some are reported to link with other pathways [11,12]). Antizyme paralogs usually have a distinct expression pattern with certain paralogs being expressed in a strictly restrictive tissue-specific manner, such as testis-specific mammalian antizyme 3 [8,9] or retina and brain specific antizyme AZR from Danio rerio [13]. Reviews of antizyme function and distribution are available [10,14,15].
Given the important role that antizymes play in the regulation of polyamine concentrations, it is not surprising that their own biosynthesis is regulated in response to changes of cellular polyamine concentrations. Polyamines' concentrations are sensed during the elongation stage of antizyme mRNA translation. Unlike the great majority of CDS-es, that for virtually all eukaryotic antizymes consists of two overlapping open reading frames. Synthesis of full-length antizyme protein requires a portion of translating ribosomes to switch translation phase at the end of the first ORF into the partially overlapping ORF (in +1 translation phase) in a process termed programmed ribosomal frameshifting [16]. The portion of ribosomes that do not shift frames, terminate at the end of the first ORF with release of relatively short encoded polypeptide. Increases in cellular polyamine levels result in elevated frameshifting efficiency and so of synthesis of fully functional antizyme. The competition between frameshifting and termination at the end of the first ORF is a sensor of polyamine concentration that provides an elegant mechanism for regulatory negative feedback (Figure 1A).
The +1 frameshifting event during antizyme biosynthesis significantly complicates automatic detection of its fulllength CDS in mRNA. This is due to the lack of reliable and efficient algorithms for predicting ribosomal frameshifting locations. A number of attempts have been made recently to develop computational approaches for predicting instances of the ribosomal frameshifting [17][18][19][20][21][22]. Some of these approaches could be useful for detecting candidate sequences that are prone to efficient (not necessarily programmed) frameshifting within particular groups of organisms [17][18][19]23]. However, they are not suitable for reliable detection of programmed ribosomal frameshifting events without experimental verification or additional expert human involvement. The reasons underlying the consistent failure to develop highly accurate algorithms for ribosomal frameshifting prediction lie in the very nature of programmed ribosomal frameshifting. The efficiency of ribosomal frameshifting is modulated by highly diverse sequence elements many of which evolved independently. The mechanisms by which such elements alter translation also vary considerably. The situation is further complicated by differences in the translation machinery (sequences of ribosomal components, differences in tRNAs properties and their relative concentrations) across different organisms, leading to a situation where the same sequence is shift-prone in one organism, but in another it is accurately translated in a standard triplet-manner. Therefore, it is not possible to find even a single nucleotide sequence feature that would specify a site of ribosomal frameshifting universal for all organisms. Information regarding the diversity of genes utilizing programmed ribosomal frameshifting for their expression as well as multifarious sequences modulating frameshifting process is available at the Recode database, which is currently the richest Internet resource [24,25], as well as, comprehensive literature reviews on this and related topics [26][27][28][29][30][31][32][33][34][35]. In fact, currently antizyme mRNAs themselves are the most plentiful source of diverse frameshift stimulator signals as evident from the recent detailed review covering nearly three hundred antizyme mRNA sequences [10]. A collection of sequences described in that review was used here for the design of OAF (Additional file 1).
It appears that approaches to predict frameshifting specifically for particular clusters of related genes produce more reliable results. Such approaches were applied for -1 frameshifting involved in the synthesis of viral polyproteins [21], different types of frameshifting events in decoding bacteriophage tail assembly genes [20], and +1 frameshifting during the synthesis of bacterial release factors 2 [22]. Indeed ribosomal frameshifting utilized by a group of homologous genes likely has the same origin. While evolution introduces organism specific alterations in the sequence of the frameshifting cassette, as well as, diversifying protein sequence, a detectable degree of similarity is frequently recognizable. Though existence of such similarity may not be a universal rule (as evident with the frameshifting utilized in decoding bacteriophage tail assembly genes [20] where only genomic localization of overlapping ORFs is conserved), it holds true for many cases. Therefore, knowledge of a few examples of ribosomal frameshifting from homologous genes can be sufficient for designing algorithms for automatic and accurate prediction of ribosomal frameshifting utilized in decoding of homologous genes. By dealing with each group of homologous genes utilizing ribosomal frameshifting separately one-by-one, we aim to build a collection of autonomic computer tools capable of automatically predicting most cases of ribosomal frameshifting in newly sequenced organisms. OAF is our second computer tool designed in pursuit of this goal. Our first tool, ARFA detects and annotates the programmed ribosomal frameshifting required for expression of certain bacterial release factors [22]. Both tools will be used for future updates of the Recode database.

Implementation
OAF is written in Perl, it utilizes BioPerl libraries [36]. The OAF Web interface was designed using PHP.

Outline of the analysis performed by OAF
Antizyme mRNAs from different organisms have evolved a remarkable assortment of RNA signals for stimulating or modulating the +1 ribosomal frameshifting used in their expression. Many sequence features are shared among closely related antizyme mRNAs. For example, two distinct types of frameshift-enhancing RNA pseudoknots are embedded in antizyme-1 and antizyme-2 mRNAs from mammals. Nevertheless, not a single feature is universally conserved. Instead of trying to account for known frameshifting stimulators, we have devised an antizyme gene detection scheme based on detection of sequences encoding antizymes. While antizyme protein sequences are highly diverse, there is a reasonable degree of sequence similarity within large phylogenetic groups allowing their detection based on similarity searches. Most importantly, eukaryotic antizyme genes share the same ORF organisation: the upstream ORF is smaller than the downstream ORF and the downstream ORF is always in the +1 translational phase relative to the first one. Therefore our Scheme of negative regulatory feedback in regulation of antizyme synthesis and conservation of the frameshift site method is based on a search for two overlapping ORFs corresponding to profile HMMs designed using sequences of known antizymes. Mutual orientation of the ORFs is further examined to verify that it corresponds to an expected transition between translational phases. For large sequences (>20 kb), OAF performs an initial FASTA search with relaxed parameters, where a mixture of divergent antizyme sequences is used as a query. This is used to increase OAF speed by reducing the number of candidate sequences for subsequent HMM analysis. Relaxed parameters decrease the chances of losing true positives in this process. The scheme of analyses performed by OAF is illustrated in Figure 2. Figure 2 Scheme of analyses performed by OAF. OAF pipeline. Each step performed by OAF is shown as a box, grey boxes represent external modules utilized by OAF.

Profile HMMs and automatic classification of antizymes
To design profile HMMs exploited by OAF, we used a collection of protein sequences derived from mRNA fragments using manually assembled ESTs. These sequences were described in some detail in a recent antizyme review [10] and are available in this article as an Additional File 1 (manualOAZs.fasta). Evolutionary distances between protein sequences were estimated using a Neighbour-Joining algorithm and poisson correction evolutionary model implemented in MEGA3.1 program [37]. Based on these distances, sequences were clustered into 12 homologous groups for which separate pairs of profile HMMs were designed using HMMER [38]. These HMMs are used to allow discrimination among different antizyme paralogs and to permit approximate estimation of the taxonomic origins of antizyme encoding sequences. The clustering is shown on the tree generated with MEGA3.1 (see Figure 3).
A separate profile HMM is built for the frameshift site itself. This HMM is not used for identification of antizymes or frameshift sites. However a predicted frameshift site is compared to the HMM and corresponding E-score can be reported in the output to facilitate fur- Figure 3 OAZ clustering. A circular tree of OAZ sequences representing clustering that has been used to design profile-HMMs used by OAF.

OAZ clustering
ther processing of data such as identification of unusual frameshift sites or detection of sequencing errors disguised as cryptic frameshift sites. Figure 1B illustrates conservation of OAZ frameshift sites as a web logo [39].

OAF I/O interface Input
There are two types of searches that can be performed by OAF. First a given nucleotide sequence or multiple sequences (either provided in a user's file in a fasta format or as a Genbank accession number) can be analyzed for the presence of antizyme CDS (first two modes in Figure  2). Second (third mode in Figure 2), protein sequences of known antizymes in a user's fasta file can be used as query for a search against a database of nucleotide sequences (either in a local Blast database or in a remote BLAST database at NCBI). A user can specify the genetic code table and usage of alternative initiation codons (by default CDS can start only with ATG/AUG).

Output
OAF reports sequences of encoded antizymes either as raw sequence, or in fasta, genbank or XML format. XML contains detailed information regarding the frameshift site and is compatible with a future version of Recode database. By default, OAF reports all sequences encoding antizymes, even if their ORF organization does not correspond to that for utilization of +1 frameshifting or if only a partial antizyme sequence is found. Such, likely erroneous sequences, can be filtered out automatically.

Web interface
The web interface of OAF (see Availability and Requirements section). It serves mostly illustrative purposes and has limited capabilities compared to a full version of Oaf. Web service allows analysis of a single user-provided sequence for the presence of encoding antizyme.

Results and Discussion
To evaluate OAF prediction sensitivity for genome annotations, the mRNA sequences of 20 completed eukaryotic genomes were downloaded from the RefSeq database [40]. OAF detected 18 OAZ genes (Table 1). No genes encoding antizymes were detected in plant genomes (Table 1). To evaluate OAF prediction selectivity, a random sequence database (totalling 10 Tbp) was generated by a fifth order Markov chains based on six-mer frequencies of each mRNA of the genomic sequences. OAF did not detect any OAZ sequence in this database.
To estimate OAF accuracy on EST sequences, the June 2007 dbEST was used [41]. OAF detected antizyme sequences in 6639 ESTs, among them there are 2067 unique sequences coding for antizyme. Many of these sequences were truncated mRNA fragments that can be grouped as corresponding to the same antizyme mRNA. 24 new antizyme sequences, which were not present in the original dataset (Additional file 1), were detected, see Table 2.
OAF has detected a number of highly similar variant OAZ sequences supported by multiple ESTs corresponding to the same species. Some of these sequences are most likely allelic variants while others correspond to recent gene duplication events. OAZ variants are summarized in Table  3.
OAF detected a number of sequences whose OAZ clustering ( Figure 3) did not match the taxonomy of the source organisms. These sequences are likely contaminants that were introduced from pests, symbionts, food or cell hosts (see Table 4). Some of these contaminations were previously reported in [10].

Conclusion
We have developed a simple computer utility for identification of OAZ encoding sequences in nucleic acids, called OAF (ODC antizyme finder). It performs with high speed and accuracy on mRNA sequences annotated in completed genomes as well as on raw RNA sequences from EST collections.

Competing interests
The authors declare that they have no competing interests.

Authors' contributions
MB designed and scripted OAF and its web interface. IPI manually reconstructed antizyme mRNA sequences from EST collections. JFA provided encouragement, general coordination and financial support to the project. PVB  Summary of OAZ sequence variants. Names of organisms are shown in the first column. The second column shows OAZ classification according to OAF clustering. The third column lists accession numbers for representative ESTs. The fourth column shows a number of EST sequences containing complete OAZ CDS. The fifth column shows a number of ESTs containing a partial OAZ CDS. Likely reasons for the existence of OAZ variants are given in the last column. Asterisks indicate no complete genome sequence for the corresponding organism, meaning that our assessment of variant origin is presumptive. CK868153 OAZ corresponding to sequences from contaminating organisms are shown. Names of organisms where contaminants were found are given in the first column. The second column represents the corresponding taxonomic family. The third column shows an OAZ cluster to which an OAZ is expected to belong based on the taxonomy of the source organism. The third column shows a cluster to which the sequence belongs. The fourth column lists the likely source of contamination. The fifth contains EST accession numbers.
conceived the project, helped to design OAF and wrote the manuscript. All authors have contributed to the final revision of the manuscript.