Detecting overlapping coding sequences in virus genomes

Background Detecting new coding sequences (CDSs) in viral genomes can be difficult for several reasons. The typically compact genomes often contain a number of overlapping coding and non-coding functional elements, which can result in unusual patterns of codon usage; conservation between related sequences can be difficult to interpret – especially within overlapping genes; and viruses often employ non-canonical translational mechanisms – e.g. frameshifting, stop codon read-through, leaky-scanning and internal ribosome entry sites – which can conceal potentially coding open reading frames (ORFs). Results In a previous paper we introduced a new statistic – MLOGD (Maximum Likelihood Overlapping Gene Detector) – for detecting and analysing overlapping CDSs. Here we present (a) an improved MLOGD statistic, (b) a greatly extended suite of software using MLOGD, (c) a database of results for 640 virus sequence alignments, and (d) a web-interface to the software and database. Tests show that, from an alignment with just 20 mutations, MLOGD can discriminate non-overlapping CDSs from non-coding ORFs with a typical accuracy of up to 98%, and can detect CDSs overlapping known CDSs with a typical accuracy of 90%. In addition, the software produces a variety of statistics and graphics, useful for analysing an input multiple sequence alignment. Conclusion MLOGD is an easy-to-use tool for virus genome annotation, detecting new CDSs – in particular overlapping or short CDSs – and for analysing overlapping CDSs following frameshift sites. The software, web-server, database and supplementary material are available at .


Background
Methods for finding protein-coding sequences (CDSs) in prokaryotes and eukaryotes are well-developed. Algorithms generally make use of combinations of the following signatures of CDSs: (a) codon or dicodon bias etc., (b) conservation between species, (c) similarity to known sequences, (d) presence of open reading frames (ORFs), splice sites etc., and (e) expression in cDNA/EST libraries [1].
In virus genomes, however, the situation can be complicated by a number of factors, that may lead to decreased sensitivity: (a) virus genomes are often too small (e.g. < 10 kb) to obtain codon usage statistics and, in any case, the compact genomes often contain overlapping coding and non-coding functional elements that can result in unusual codon usage patterns; (b) regions of high conservation between related sequences may not necessarily be coding and, where CDSs and/or non-coding functional elements overlap, conservation may only reveal the presence of one of the overlapping pair; (c) new virus types often contain novel CDSs, dissimilar to previously annotated CDS; and (d) viruses can employ a variety of non-canonical translational mechanisms -e.g. frameshifting, stop codon readthrough, leaky-scanning and internal ribosome entry sites.
Comparative genomics is a particularly useful way to detect new CDSs in virus genomes, because many sequenced virus genomes, covering a useful range of diversity (i.e. sequence divergence), are available. In its simplest form, a comparative genomics approach consists of looking for genome regions that are more conserved than average between related sequences. Such an approach may fail to distinguish CDSs from other conserved elements. A more advanced approach is to look for the particular mutation patterns associated with CDSse.g. the CRITICA software [2], or pair hidden Markov models [3]. However, previous such algorithms have not dealt adequately with the case of overlapping CDSs.
In a previous paper [4] we introduced a probabilistic model for the mutation patterns associated with non-coding, single-coding and double-coding regions of a multiple sequence alignment, and a maximum likelihood statistic -called MLOGD -for predicting whether a new query ORF is coding or non-coding. Here we present (a) an improved MLOGD statistic, (b) a greatly extended suite of software using MLOGD (70% is new relative to [4], the rest has been substantially revised), (c) a database of results in virus genomes, and (d) a web-interface to the software and database.

Implementation
The MLOGD statistic Given an input sequence alignment, a null model of the CDS annotation (i.e. the known CDSs in some chosen reference sequence) and an alternate model (i.e. the known CDSs plus a new putative CDS), the MLOGD statistic is an estimate of the relative probabilities of obtaining the observed pattern of mutations across the alignment under each of the null and alternate models. In this subsection we first describe how the MLOGD statistic is calculated for a pairwise sequence alignment. Then we describe how this is extended to a multiple sequence alignment. More extensive notes are given on the website.

MLOGD statistic for a two-sequence alignment
Given two aligned sequences S 1 and S 2 , we estimate the probability that S 1 mutates to S 2 , after time t, by where and are the nucleotides in S 1 and S 2 at the kth alignment position (see website for treatment of alignment gaps); P( → ; t, m k ) is calculated using nucleotide, codon and amino acid substitution matrices, as described in [4] (with the obvious extension for the noncoding model); M is the null or alternate model; and m k is the coding status (non-coding, single-coding or doublecoding) at the kth alignment position, according to the relevant model M (defined on the chosen reference sequence).
We define the pairwise sequence divergence, Λ, to be the total number of point nucleotide differences between S 1 and S 2 , and we determine t numerically for each of the null and alternate models by requiring the expected number of mutations between S 1 and S 2 , under the model, to equal the observed number of mutations, Λ.
The log likelihood ratio of the two models is If log(LR) is positive, then the observed mutations between S 1 and S 2 are more consistent with the alternate model. If log(LR) is negative, then the observed mutations are more consistent with the null model.

MLOGD statistic for a multiple sequence alignment
The MLOGD statistic, ∑ tree log(LR), is calculated for a multiple sequence alignment using the following procedure. First a phylogenetic tree is constructed using standard software (e.g. PHYLIP, [5]). The (unrooted) tree is used to select a list of sequence pairs tracing round the outside of the tree (figure on website). Such a set of pairwise comparisons covers each branch of the tree precisely twice. The MLOGD statistic, log(LR), is calculated for each of the sequence pairs in the list, summed up over all the pairs, and divided by two, to give the MLOGD statistic for the multiple sequence alignment, ∑ tree log (LR). Similarly, the total number of mutations, ∑ tree Λ across the phylogenetic tree is the sum of Λ values for each sequence pair, divided by two.

Input data
The required input data for MLOGD are a multiple sequence alignment of related sequences, a list of known CDSs (possibly none) in a chosen reference sequence, and a phylogenetic tree. Circular genomes are fully supported. For viruses, useful sets of related sequences may be obtained from the NCBI Viral Genomes Project website [6]. There are tools on the web-server to help produce a suitable alignment and phylogenetic tree. ).

( )
Nucleotide-by-nucleotide plot  Six-frame sliding window plot Figure 2 Six-frame sliding window plot. Example output plot for the 'Six-frame sliding window plots' option (same sequences as in Figure 1). This is a plot of the ∑ tree log (LR) statistic calculated in a sliding window along the alignment in each of the six possible read-frames. In each window, the null model is that 'only the known CDS(s) are coding' while the alternate model is that 'both the window and the known CDS

Operation modes
The MLOGD software has three operation modes, described below. The 'Test input query CDSs' option can be used to test a specific query CDS (e.g. an ORF that has not previously been annotated as a CDS). The 'Find and test all non-annotated ORFs' and 'Six-frame sliding window plots' options can be used to search a whole input alignment for new CDSs.

Test input query CDSs
Here MLOGD calculates the null versus alternate model likelihood ratio statistics, where the null model is that the query ORF is non-coding, while the alternate model is that the query ORF is coding (both the null and alternate models include all the annotated CDSs).
The output includes: 3. A nucleotide-by-nucleotide plot of the log(LR) statistic for each reference -non-reference sequence pair, the sum over the phylogenetic tree, and running means of the same (Figure 1). On the web-server there are links to zoom in on the plot and/or adjust the running-mean window size.

Find and test all non-annotated ORFs
The 'Find and test all non-annotated ORFs' option finds all non-annotated ORFs longer than a specified minimum length, and produces the same statistics and plots as the 'Test input query CDSs' option for each of these ORFs. The user may select 'start-stop' ORFs or 'stop-stop' ORFs.

Six-frame sliding window plots
Sometimes a CDS may be entered through a stop codon read-through, ribosomal frameshift, or splicing event rather than with an AUG codon. The 'Find and test all non-annotated ORFs' option (above) may not be the best way to locate such CDSs, since such CDSs generally commence part-way through an ORF. The 'Six-frame sliding window plots' option calculates the ∑ tree log(LR) statistic in a window sliding along the alignment in all six reading frames ( Figure 2). The user may select the window and step sizes. Extended regions of positive signal may indi-cate potential new CDSs, especially where there is an absence of stop codons across the alignment.

Sensitivity and selectivity
The software has been previously tested on simulated data and on overlapping CDSs in the Hepatitis B Virus and Escherichia coli genomes [4]. Further tests showed that, for alignments with just 20 mutations overall (i.e. ∑ tree Λ = 20; e.g. a pairwise comparison of two 100 nt sequences with a mean divergence of 0.2 mutations per nt), MLOGD can discriminate nonoverlapping CDSs from non-coding ORFs with a typical accuracy of up to 98%, and can detect CDSs overlapping known CDSs with a typical accuracy of 90% (see website for details). In general usage, ∑ tree Λ is often much greater than 20, with correspondingly lower predicted error rates.

On-line virus database
A database of results for 640 virus sequence alignments is available on the website. The database contains multiple sequence alignments, phylogenetic trees, positions of known CDSs, six-frame sliding window plots, statistics and plots for the annotated CDSs, and statistics and plots for all non-annotated start-stop ORFs in the reference sequences of at least 40 codons in length.

Conclusion
We have presented (a) a new tool for locating and analysing CDSs in virus alignments, and (b) an on-line database of results in 640 virus alignments. Besides the easy-to-use website and comprehensive output, the main advantage of MLOGD over other gene-finding software is that MLOGD explicitly takes into account the possibility of overlapping genes -common in viruses. For example, for the Hepatitis B, Avian Hepatitis B, Polerovirus, Luteovirus and Human Immunodeficiency Virus 1 genomes (compact genomes with relatively high fractions of overlapping CDSs), MLOGD successfully finds all 28 known CDSs, while GeneMark only finds 17 (VIOLIN database, [7]). We have extensively tested the sensitivity of MLOGD and shown it to be more sensitive than other methods for detecting overlapping CDSs [4].
MLOGD can, of course, also be used for cellular organisms. Partially overlapping CDSs are fairly common in prokaryotes, but it is not clear what fraction of overlaps are functionally constrained. Many appear to be the result of the loss of a stop codon, allowing one CDS to run into an adjacent CDS [8]. Others may be involved in regulatory mechanisms [9]. Similarly, many potential ribosomal frameshift sites -leading to overlapping CDSs -have been identified in cellular organisms [10], as well as viral genomes. MLOGD is a valuable tool for analysing the magnitude of functional constraints on such overlaps, with implications for the annotation of putative frameshift sites, and the evolution of overlapping genes in viruses and in prokaryotes.