Detecting overlapping coding sequences in virus genomes
© Firth and Brown, licensee BioMed Central Ltd. 2006
Received: 9 September 2005
Accepted: 16 February 2006
Published: 16 February 2006
Detecting new coding sequences (CDSs) in viral genomes can be difficult for several reasons. The typically compact genomes often contain a number of overlapping coding and non-coding functional elements, which can result in unusual patterns of codon usage; conservation between related sequences can be difficult to interpret – especially within overlapping genes; and viruses often employ non-canonical translational mechanisms – e.g. frameshifting, stop codon read-through, leaky-scanning and internal ribosome entry sites – which can conceal potentially coding open reading frames (ORFs).
In a previous paper we introduced a new statistic – MLOGD (Maximum Likelihood Overlapping Gene Detector) – for detecting and analysing overlapping CDSs. Here we present (a) an improved MLOGD statistic, (b) a greatly extended suite of software using MLOGD, (c) a database of results for 640 virus sequence alignments, and (d) a web-interface to the software and database. Tests show that, from an alignment with just 20 mutations, MLOGD can discriminate non-overlapping CDSs from non-coding ORFs with a typical accuracy of up to 98%, and can detect CDSs overlapping known CDSs with a typical accuracy of 90%. In addition, the software produces a variety of statistics and graphics, useful for analysing an input multiple sequence alignment.
MLOGD is an easy-to-use tool for virus genome annotation, detecting new CDSs – in particular overlapping or short CDSs – and for analysing overlapping CDSs following frameshift sites. The software, web-server, database and supplementary material are available at http://guinevere.otago.ac.nz/mlogd.html.
Methods for finding protein-coding sequences (CDSs) in prokaryotes and eukaryotes are well-developed. Algorithms generally make use of combinations of the following signatures of CDSs: (a) codon or dicodon bias etc., (b) conservation between species, (c) similarity to known sequences, (d) presence of open reading frames (ORFs), splice sites etc., and (e) expression in cDNA/EST libraries .
In virus genomes, however, the situation can be complicated by a number of factors, that may lead to decreased sensitivity: (a) virus genomes are often too small (e.g. < 10 kb) to obtain codon usage statistics and, in any case, the compact genomes often contain overlapping coding and non-coding functional elements that can result in unusual codon usage patterns; (b) regions of high conservation between related sequences may not necessarily be coding and, where CDSs and/or non-coding functional elements overlap, conservation may only reveal the presence of one of the overlapping pair; (c) new virus types often contain novel CDSs, dissimilar to previously annotated CDS; and (d) viruses can employ a variety of non-canonical translational mechanisms – e.g. frameshifting, stop codon read-through, leaky-scanning and internal ribosome entry sites.
Comparative genomics is a particularly useful way to detect new CDSs in virus genomes, because many sequenced virus genomes, covering a useful range of diversity (i.e. sequence divergence), are available. In its simplest form, a comparative genomics approach consists of looking for genome regions that are more conserved than average between related sequences. Such an approach may fail to distinguish CDSs from other conserved elements. A more advanced approach is to look for the particular mutation patterns associated with CDSs – e.g. the CRITICA software , or pair hidden Markov models . However, previous such algorithms have not dealt adequately with the case of overlapping CDSs.
In a previous paper  we introduced a probabilistic model for the mutation patterns associated with non-coding, single-coding and double-coding regions of a multiple sequence alignment, and a maximum likelihood statistic – called MLOGD – for predicting whether a new query ORF is coding or non-coding. Here we present (a) an improved MLOGD statistic, (b) a greatly extended suite of software using MLOGD (70% is new relative to , the rest has been substantially revised), (c) a database of results in virus genomes, and (d) a web-interface to the software and database.
The MLOGD statistic
Given an input sequence alignment, a null model of the CDS annotation (i.e. the known CDSs in some chosen reference sequence) and an alternate model (i.e. the known CDSs plus a new putative CDS), the MLOGD statistic is an estimate of the relative probabilities of obtaining the observed pattern of mutations across the alignment under each of the null and alternate models. In this subsection we first describe how the MLOGD statistic is calculated for a pairwise sequence alignment. Then we describe how this is extended to a multiple sequence alignment. More extensive notes are given on the website.
MLOGD statistic for a two-sequence alignment
Given two aligned sequences S1 and S2, we estimate the probability that S1 mutates to S2, after time t, by
where and are the nucleotides in S1 and S2 at the k th alignment position (see website for treatment of alignment gaps); P( → ; t, m k ) is calculated using nucleotide, codon and amino acid substitution matrices, as described in  (with the obvious extension for the non-coding model); M is the null or alternate model; and m k is the coding status (non-coding, single-coding or double-coding) at the k th alignment position, according to the relevant model M (defined on the chosen reference sequence).
We define the pairwise sequence divergence, Λ, to be the total number of point nucleotide differences between S1 and S2, and we determine t numerically for each of the null and alternate models by requiring the expected number of mutations between S1 and S2, under the model, to equal the observed number of mutations, Λ.
The log likelihood ratio of the two models is
If log(LR) is positive, then the observed mutations between S1 and S2 are more consistent with the alternate model. If log(LR) is negative, then the observed mutations are more consistent with the null model.
MLOGD statistic for a multiple sequence alignment
The MLOGD statistic, ∑tree log(LR), is calculated for a multiple sequence alignment using the following procedure. First a phylogenetic tree is constructed using standard software (e.g. PHYLIP, ). The (unrooted) tree is used to select a list of sequence pairs tracing round the outside of the tree (figure on website). Such a set of pairwise comparisons covers each branch of the tree precisely twice. The MLOGD statistic, log(LR), is calculated for each of the sequence pairs in the list, summed up over all the pairs, and divided by two, to give the MLOGD statistic for the multiple sequence alignment, ∑tree log (LR). Similarly, the total number of mutations, ∑tree Λ across the phylogenetic tree is the sum of Λ values for each sequence pair, divided by two.
The required input data for MLOGD are a multiple sequence alignment of related sequences, a list of known CDSs (possibly none) in a chosen reference sequence, and a phylogenetic tree. Circular genomes are fully supported. For viruses, useful sets of related sequences may be obtained from the NCBI Viral Genomes Project website . There are tools on the web-server to help produce a suitable alignment and phylogenetic tree.
The MLOGD software has three operation modes, described below. The 'Test input query CDSs' option can be used to test a specific query CDS (e.g. an ORF that has not previously been annotated as a CDS). The 'Find and test all non-annotated ORFs' and 'Six-frame sliding window plots' options can be used to search a whole input alignment for new CDSs.
Test input query CDSs
Here MLOGD calculates the null versus alternate model likelihood ratio statistics, where the null model is that the query ORF is non-coding, while the alternate model is that the query ORF is coding (both the null and alternate models include all the annotated CDSs).
A table of log(LR) statistics for each reference – non-reference sequence pair.
A plot of the log(LR) statistic for each reference – non-reference sequence pair and summed over the phylogenetic tree. On the web-server there is a link to generate Monte Carlo simulated sequences under the same null and alternate models. The simulations are used to estimate confidence limits on the log(LR) statistics. (Figures not shown.)
A nucleotide-by-nucleotide plot of the log(LR) statistic for each reference – non-reference sequence pair, the sum over the phylogenetic tree, and running means of the same (Figure 1). On the web-server there are links to zoom in on the plot and/or adjust the running-mean window size.
Find and test all non-annotated ORFs
The 'Find and test all non-annotated ORFs' option finds all non-annotated ORFs longer than a specified minimum length, and produces the same statistics and plots as the 'Test input query CDSs' option for each of these ORFs. The user may select 'start-stop' ORFs or 'stop-stop' ORFs.
Six-frame sliding window plots
Results and discussion
Sensitivity and selectivity
The software has been previously tested on simulated data and on overlapping CDSs in the Hepatitis B Virus and Escherichia coli genomes . In a further test on 14 virus alignments, all 37 known CDSs were detected (including five examples of overlapping CDSs completely contained within other CDSs, and 20 CDSs that partially overlap other CDSs). Conversely, the false positive rate for all non-coding ORFs of at least 40 codons was 0.06 (-2 frame overlaps excluded; see website for details). In addition, all the false positives had very low MLOGD scores (outside the range observed for the known CDSs).
Further tests showed that, for alignments with just 20 mutations overall (i.e. ∑tree Λ = 20; e.g. a pairwise comparison of two 100 nt sequences with a mean divergence of 0.2 mutations per nt), MLOGD can discriminate non-overlapping CDSs from non-coding ORFs with a typical accuracy of up to 98%, and can detect CDSs overlapping known CDSs with a typical accuracy of 90% (see website for details). In general usage, ∑tree Λ is often much greater than 20, with correspondingly lower predicted error rates.
On-line virus database
A database of results for 640 virus sequence alignments is available on the website. The database contains multiple sequence alignments, phylogenetic trees, positions of known CDSs, six-frame sliding window plots, statistics and plots for the annotated CDSs, and statistics and plots for all non-annotated start-stop ORFs in the reference sequences of at least 40 codons in length.
We have presented (a) a new tool for locating and analysing CDSs in virus alignments, and (b) an on-line database of results in 640 virus alignments. Besides the easy-to-use website and comprehensive output, the main advantage of MLOGD over other gene-finding software is that MLOGD explicitly takes into account the possibility of overlapping genes – common in viruses. For example, for the Hepatitis B, Avian Hepatitis B, Polerovirus, Luteovirus and Human Immunodeficiency Virus 1 genomes (compact genomes with relatively high fractions of overlapping CDSs), MLOGD successfully finds all 28 known CDSs, while GeneMark only finds 17 (VIOLIN database, ). We have extensively tested the sensitivity of MLOGD and shown it to be more sensitive than other methods for detecting overlapping CDSs .
MLOGD can, of course, also be used for cellular organisms. Partially overlapping CDSs are fairly common in prokaryotes, but it is not clear what fraction of overlaps are functionally constrained. Many appear to be the result of the loss of a stop codon, allowing one CDS to run into an adjacent CDS . Others may be involved in regulatory mechanisms . Similarly, many potential ribosomal frameshift sites – leading to overlapping CDSs – have been identified in cellular organisms , as well as viral genomes. MLOGD is a valuable tool for analysing the magnitude of functional constraints on such overlaps, with implications for the annotation of putative frameshift sites, and the evolution of overlapping genes in viruses and in prokaryotes.
Availability and requirements
The MLOGD software and virus database are available at http://guinevere.otago.ac.nz/mlogd.html (see also Additional file 1). Sequences may be entered into the web-interface or the software (C++ programmes, C-shell scripts; distributed under the GNU General Public License) may be downloaded and used locally. To install the software locally, the publicly available packages EMBOSS  and R  must also be installed. The programme codaln  is recommended for aligning the input sequences. Run-time and resource-use scale approximately linearly with the number of sequences and the length of the input alignment. On a Pentium 4 2.8 GHz processor, analysing a 900 nt ORF takes ~3 s for a five-sequence alignment, while running six-frame sliding window plots (default window sizes) for a 10000 nt region takes ~300 s.
AEF gratefully acknowledges funding from the New Zealand Foundation for Research, Science and Technology, grant number UOOX0304. CMB gratefully acknowledges funding from the NZ Health Research Council.
- Stormo GD: Gene-finding approaches for eukaryotes. Genome Res 2000, 10: 394–397. 10.1101/gr.10.4.394View ArticlePubMedGoogle Scholar
- Badger JH, Olsen GJ: CRITICA: Coding Region Identification Tool Invoking Comparative Analysis. Mol Biol Evol 1999, 16: 512–524.View ArticlePubMedGoogle Scholar
- Majoros WH, Pertea M, Salzberg SL: Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 2005, 21: 1782–1788. 10.1093/bioinformatics/bti297View ArticlePubMedGoogle Scholar
- Firth AE, Brown CM: Detecting overlapping coding sequences with pairwise alignments. Bioinformatics 2005, 21: 282–292. 10.1093/bioinformatics/bti007View ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6.2004. [http://evolution.genetics.washington.edu/phylip.html]Google Scholar
- Bao Y, Federhen S, Leipe D, Pham V, Resenchuk S, Rozanov M, Tatusov R, Tatusova T: National center for biotechnology information viral genomes project. J Virol 2004, 78: 7291–7298. [http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html] 10.1128/JVI.78.14.7291-7298.2004PubMed CentralView ArticlePubMedGoogle Scholar
- Mills R, Rozanov M, Lomsadze A, Tatusova T, Borodovsky M: Improving gene annotation of complete viral genomes. Nucleic Acids Res 2003, 31: 7041–7055. [http://opal.biology.gatech.edu/GeneMark/VIOLIN/] 10.1093/nar/gkg878PubMed CentralView ArticlePubMedGoogle Scholar
- Fukuda Y, Nakayama Y, Tomita M: On dynamics of overlapping genes in bacterial genomes. Gene 2003, 323: 181–187. 10.1016/j.gene.2003.09.021View ArticlePubMedGoogle Scholar
- Johnson ZI, Chisholm SW: Properties of overlapping genes are conserved across microbial genomes. Genome Res 2004, 14: 2268–2272. 10.1101/gr.2433104PubMed CentralView ArticlePubMedGoogle Scholar
- Hammell AB, Taylor RC, Peltz SW, Dinman JD: Identification of putative programmed – 1 ribosomal frameshift signals in large DNA databases. Genome Res 1999, 9: 417–427.PubMed CentralPubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16: 276–277. [http://emboss.sourceforge.net/] 10.1016/S0168-9525(00)02024-2View ArticlePubMedGoogle Scholar
- The R Project for Statistical Computing[http://www.r-project.org/]
- Stocsits RR, Hofacker IL, Fried C, Stadler PF: Multiple Sequence Alignments of Partially Coding Nucleic Acid Sequences. BMC Bioinformatics 2005, 6: 160. 10.1186/1471-2105-6-160PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.