Multiple sequence alignments of partially coding nucleic acid sequences
© Stocsits et al. 2005
Received: 03 February 2005
Accepted: 28 June 2005
Published: 28 June 2005
Skip to main content
© Stocsits et al. 2005
Received: 03 February 2005
Accepted: 28 June 2005
Published: 28 June 2005
High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes.
The standard scoring scheme for nucleic acid alignments can be extended to incorporate simultaneously information on translation products in one or more reading frames. Here we present a multiple alignment tool, codaln, that implements a combined nucleic acid plus amino acid scoring model for pairwise and progressive multiple alignments that allows arbitrary weighting for almost all scoring parameters. Resource requirements of codaln are comparable with those of standard tools such as ClustalW.
We demonstrate the applicability of codaln to various biologically relevant types of sequences (bacteriophage Levivirus and Vertebrate Hox clusters) and show that the combination of nucleic acid and amino acid sequence information leads to improved alignments. These, in turn, increase the performance of analysis tools that depend strictly on good input alignments such as methods for detecting conserved RNA secondary structure elements.
Multiple sequence alignments are a crucial prerequisite for a diverse set of methods ranging from the reconstruction of phylogenies and the quantification of adaptive evolution, to the detection of conserved RNA secondary structures and protein motifs. In this contribution we present a novel alignment tool that has been developed primarily with the aim of improving multiple alignments of viral genomes. The genomes of RNA viruses often carry conserved RNA structures that perform vital functions during the life cycle of the virus. In many cases only a small part of the viral genome is functionally relevant at the level of RNA. Algorithms for the systematic search of conserved secondary structure patterns in large RNA, such as QRNA , alidot [2–4], RNAz , and RNAdecoder  are based on the observation that a small number of point mutations is very likely to cause large changes in the secondary structures . Secondary structure elements that are consistently present in a group of sequences with less than, say, 95% average pairwise identity are therefore most likely the result of stabilizing selection, not a consequence of the high degree of sequence conservation.
A comprehensive analysis of the genomic secondary structure features using alidot is available for Picornaviridae , Flaviviridae , and Hepadnaviridae [10, 11]. A preliminary survey across a large subset of the available sequence data was presented very recently .
The comparative approach to detect conserved RNA structures is obviously dependent upon high-quality multiple alignments: even a relative small number of alignment errors, which act like randomly placed mutations, will obscure the signals from consistent and compensatory point mutations and, hence, decrease the sensitivity of the RNA detection algorithms. While we eventually need an alignment of the genomic nucleic acid sequence, we observe that an overwhelming part of a viral genome codes for one or more proteins in one or more (overlapping) frames.
In this contribution we describe a progressive alignment tool that implements an extended scoring scheme to incorporate simultaneously information on translation products in one or more ([partly] overlapping) reading frames which allows the user to combine all information from both the nucleic acid and amino acid sequences (if any). It makes explicit use of information about overlapping open reading frames, as they occur in many functional sequences, and allows arbitrary weighting for almost all scoring parameters, in order to gain more reliable scoring at certain regions of the nucleic acid sequences where additional amino acid scoring of underlying protein sequence can be utilized.
The codaln program implements Gotoh's algorithm for pairwise sequence alignments with affine gap cost functions . The only change compared to this standard recursive algorithm for nucleic acid sequence alignment concerns the (mis)match score σ(x i , y j ) of position i from sequence x with position j from sequence y. Instead of taking into account only the nucleic acid letters, each position is considered as a vector containing the nucleic acid letter and the amino acid that would arise from translation in each of the three possible reading frames provided the frame in question is actually translated. Thus, we have
σ(x i , y j ) = β 0 σ n (x i , y j ) +
β 1 σ p (π[x i x i+1 x i+2], π[y j y j+1 y j+2]) +
β 2 σ p (π[x i-1 x i x i+1], π[y j-1 y j y j+1]) + (1)
β 3 σ p (π[x i-2 x i-1 x i ], π[y j-2 y j-1 y j ])
The score model is much simpler than the one proposed by Hein [22, 23] and implemented in combat  and CAT . In contrast to these approaches, which enforce gap lengths that are multiples of three in order to maintain the reading frame, codaln does not use special gap penalties at all. Instead, it relies on the match scores from the coding regions to guide the alignment back into the correct reading frame after a frameshift insertion or deletion. This results in an algorithm that is both faster and able to handle overlapping reading frames.
In its current implementation codaln can deal with 18 different codon tables, including the standard genetic code (default), various non-canonical tables for certain groups of organisms, and 11 distinct codon tables for mitochondrial genomes.
18 codon tables can be utilized by the program for linking the nucleic acid triplets with their corresponding amino acids.
organism featuring this codon table
universal genetic code (default)
Tetrahymena, Paramecium, Oxytrichia, Stylonychia, Glaucoma
canonical mitochondrial code
Vertebrates – mitochondrial code
Arthropods – mitochondrial code
Echinoderms – mitochondrial code
Molluscs – mitochondrial code
Ascidians – mitochondrial code
Nematodes – mitochondrial code
Plathelminths – mitochondrial code
Yeasts – mitochondrial code
Euascomycetes – mitochondrial code
Protozoans – mitochondrial code
Default scoring parameters (can be arbitrarily weighted or changed by user defined settings).
nucleic acid scores
identity 1000, else 300
gap open penalty
gap extension penalty
The profile alignments respect the full model of both nucleic acid and amino acid (mis)match scores. In the present implementation, the sequences within a profile are unweighted; it would be straightforward, however, to include a weighting scheme that takes the relative distances of the sequences into account to reduce the weight of very similar sequences, as implemented e.g. in ClustalW.
Not surprisingly, we observe that codaln multiple alignments of coding DNA sequences have a much larger fraction of gaps with a length divisible by three than ClustalW multiple alignments. This is the desired effect of including amino acid-based scoring contributions since it reduces biologically implausible frameshifts. In itself, this is of course not a direct evidence for real improvements of multiple nucleic acid sequence alignments, but it is a good indication that, in coding regions, codaln preferentially makes insertions and deletions at the protein level.
Unfortunately, good hand-curated multiple alignments of partially coding sequences do not seem to be available, so that a systematic quantitative analysis (using, e.g., the BAliBASE tools ) could not be performed. Pairwise alignments of coding DNA sequences are barely distinguishable from those obtained with combat  provided the amino acid contributions dominate codaln's scoring function. We therefore concentrate on a qualitative assessment of codaln alignments in particular application contexts.
Hox genes were first described in the fruitfly Drosophila melanogaster. They code for homeodomain containing transcription factors  and are characterized by a 60 amino-acid helix-turn-helix DNA binding homeodomain. This domain is highly conserved on the level of protein, but can be quite variable at the DNA level.
Vertebrates, in contrast to all invertebrates examined, have multiple Hox gene clusters that have arisen from a single ancestral cluster in the most recent common ancestor of chordates [29, 30]. This ancient cluster in turn evolved through tandem gene duplications from a more ancient hypothetical protohox cluster .
Virus genomes serve as an ideal test case for a procedure that makes explicit usage of information about (overlapping) coding regions to improve the resulting alignments. Improved alignments, as we shall see, increase the sensitivity of the detection of secondary structure elements.
The members of the genus Levivirus infect eubacteria (Procarya). All members of the family Leviviridae (Levivirus and Allolevivirus) are ssRNA positive-strand viruses. The replication cycle includes no DNA stage. The virions are neither enveloped nor tailed with nucleocapsids that are isometric, 24–26 nm in diameter. The total genome length is 3466 up to 4276 nucleotides depending on type of strain. Most Levivirus species have four (partly) overlapping genes, while some exceptions exist which contain only three open reading frames [32, 33].
We have investigated 8 complete genomic sequences of the Levivirus genus: The Enterobacteria phages MS2, KU1, GA, and fr. Alignments of the genomic sequences were prepared using codaln and scanned for conserved RNA secondary structures using the alidot method . The results are compared to earlier studies based on ClustalW alignments [10, 12].
Long range interactions, so called panhandle structures, are known to play a role e.g. in the replication of Bunyaviridae  and Flaviviridae . It will be interesting to see if the long-range interactions can be experimentally verified in Leviviridae as well.
The Qβ replicase amplifies RNA templates autocatalytically with high efficiency, and the recognition element, consisting of a hairpin and a short unpaired region at the 5'-terminus, is essential for recognition [36, 37].
Algorithms for the the automatic detection of biologically functional secondary structure elements, such as the ones used here, are dependent upon accurate alignments. Clearly, the quality of alignments can be enhanced by including additional biological information. In the case of codaln, we make use of the information on the coding properties of a nucleic acid sequence into the alignment process. We demonstrate this in the case of alignments of the Hox4 genomic region which consists of non-coding regions and two coding exons, one containing the highly conserved homeodomain, while the other exon is poorly conserved on nucleic acid level. As expected, the quality of the alignment in the coding region can be increased significantly.
Virus genomes can serve as an ideal test case for a procedure that makes explicit usage of information about various (overlapping) coding regions. Above we have seen that additional conserved secondary structure elements become detectable with the improved alignment. Leviviruses are, despite their short genome, a quite complex example. The sequences are at least in part highly divergent at the nucleic acid level, so that the information on the coding sequences in codaln significantly improves the quality of the alignment. Using codaln instead of a purely nucleotide-based alignment program, we found a putative recognition signal site, analog to the one for the RNA replicase in Alloleviviruses.
The codaln program was specifically developed for applications to genomic sequences of RNA viruses with their partially overlapping reading frames and untranslated regions. The Hox gene example shows, however, that codaln is also applicable to other partially coding sequence data. The recent discovery of ORFs that overlap with different reading directions [38–40] suggest to extend codaln to such cases as well. Our framework makes such an extension straightforward.
C source code and documentation may be downloaded from http://www.bioinf.uni-leipzig.de/Software/ or http://www.tbi.univie.ac.at/~roman/Codaln/.
The Hox4 sequences are taken from GenBank for Homo sapiens (HsA join(AC004080.2rc+AC010990 [201-6508]rc+AC004079 [75001-end]rc) [125253 126761], HsB NT_010783 [5306154 5309021]rc, HsC NT_009563 [586220 584941]rc, HsD NT_037537 [4224691 4225996]), Mus musculus (MmA NT_039343 [3862302 3864022]rc, MmB AC011194 [114551 116043], MmC NT_028016 [137212 139414], MmD AC015584 [164151 165456]), and Morone saxatilis (MsA AF089743 [29109 30386]). For Danio rerio the sequences are retrieved both from the web server of the Danio rerio Sequencing Project  and GenBank (DrAa AC107365rc [61628 62827], DrBa AL645782.2 [33537 35809], DrCa ctg23.10700001-10870000 [75679 77005]rc, DrD ctg13407.19000-191000 [61789 63580]rc).
rc = reverse complement; sequence intervals are listed in brackets.
This work is supported by the Austrian Fonds zur Förderung der Wissenschaftlichen Forschung, Project Nos. P-13545-MAT and P-15893, and the German DFG Bioinformatics Initiative.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.