- Software
- Open access
- Published:
CNCA aligns small annotated genomes
BMC Bioinformatics volume 25, Article number: 89 (2024)
Abstract
Background
To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the proximal origins of SARS-CoV-2 coronavirus, we wanted to construct an alignment of genomes closely related to SARS-CoV-2 using both coding and non-coding sequences. To our knowledge, there is no tool that can be used to construct this type of alignment, which motivated the creation of CNCA.
Results
CNCA is a web tool that aligns annotated genomes from GenBank files. It generates a nucleotide alignment that is then updated based on the protein sequence alignment. The output final nucleotide alignment matches the protein alignment and guarantees no frameshift. CNCA was designed to align closely related small genome sequences up to 50 kb (typically viruses) for which the gene order is conserved.
Conclusions
CNCA constructs multiple alignments of small genomes by integrating both coding and non-coding sequences. This preserves regions traditionally ignored in conventional back-translation methods, such as non-coding regions.
Background
A naive nucleotide alignment of annotated genomes usually results in many frameshifts and other oddities that do not exist in protein alignments. Several methods have been developed to perform nucleotide alignments taking protein alignment into account. One approach is “back-translation”, where coding nucleotide sequences are translated into amino acid sequences that are then aligned. Corresponding codons are then aligned in a final nucleotide alignment. The web-based tool web-prank (https://www.ebi.ac.uk/goldman-srv/webprank/; [1]) is such an example. Other tools based on back-translation propose specific options like the choice of genetic codes (PAL2NAL [2], transAlign [3], RevTans [4]). Some are designed to consider cases in which frameshifts or stop codons can occur (MACSE [5, 6], PAL2NAL [2], transAlign [3]). TranslatorX [7] checks the relevance of the amino acid alignment by finding regions of uncertainties in the amino acid alignment (masked by Gblocks [8]) and reports them in the nucleotide alignment. Others are optimized for virus gene sequences (NucAmino [9], VIRULIGN [10]. To the best of our knowledge, none of these methods processes genome alignment with both coding and non-coding regions. We have thus developed CNCA (Coding / Non-Coding Aligner), a genome-wide solution that returns a full genome alignment compatible with the protein sequence alignment. The method was designed for small (up to 50 kb) homologous annotated syntenic genomes devoid of introns, such as virus genomes. It will ease the subsequent evolutionary analysis of annotated genomes.
Implementation
CNCA is a pipeline developed in Python and R. For the alignment steps, it uses MAFFT [11]. This pipeline can be run online at https://cnca.ijm.fr/.
In addition, a standalone version is available at https://github.com/jnlorenzi/CNCA_standalone.
CNCA takes as input two or more GenBank files of annotated genomes. To cap computation time on the server, sequences submitted via the online tool must be lower than 50 kb. It first MAFFT-aligns [11] the nucleotide (nt) sequence of all genomes and produces a Multiple Sequence Alignment (MSAnt). It then generates MSAaa, the MAFFT-alignment of the concatenations of all protein sequences. As the concatenated sequence takes protein sequence on the order of gene annotations, synteny must be conserved. Note that an alternative pipeline would have been to align each coding region individually between genomes, but this approach was not chosen for the sake of speed and simplicity. The MSAnt is then updated using MSAaa for all coding regions where both alignments are not concordant. A final MSAcnca is returned that contains no contradiction with MSAaa and thus no frameshift (Fig. 1A). We choose to implement a graphical web version of the pipeline to widen the potential users to non-experts. Results (logs and the three alignments MSAcnca, MSAnt, MSAaa in both nexus and fasta formats) are stored locally for a week. An email with a link to access the results is sent to the user at the end of the procedure.
Results
As an illustration, we used CNCA on a dataset of 12 annotated genomes closely related to SARS-CoV-2. The whole pipeline runs in 45 min and generates an alignment compatible with current knowledge of coronavirus evolution. Figure 1B presents a fraction of the resulting alignment, from the end of the ORF1ab coding region to the start of the Spike coding region. The 1-bp indel present in the intergenic region between ORF1ab and Spike is detected by the CNCA approach, but not via a simple nucleotide alignment (Fig. 1C) or via a back-translation method (as it ignores non-coding regions).
Conclusions
CNCA is a user-friendly and simple online tool. It can construct multiple alignments of small genomes by integrating both coding and non-coding sequences. We developed it for coronaviruses and it can also be used for other virus families and for short syntenic genetic loci in bacteria.
Availability and requirements
Project name: CNCA.
Project home page: https://cnca.ijm.fr/
Operating system(s): Platform independent.
Programming language: Python, R, PHP.
License: MIT.
Any restrictions to use by non-academics: none.
Availability of data and materials
Project homepage: https://cnca.ijm.fr/; Standalone version available at https://github.com/jnlorenzi/CNCA_standalone.
References
Löytynoja A, Goldman N. webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinformatics. 2010;11:579.
Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34:W609-12.
Bininda-Emonds OR. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics. 2005;6:156.
Wernersson R, Pedersen AG. RevTrans: multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 2003;31:3537–9.
Ranwez V, Douzery EJP, Cambon C, Chantret N, Delsuc F. MACSE v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Mol Biol Evol. 2018;35:2582–4.
Ranwez V, Harispe S, Delsuc F, Douzery EJP. MACSE: multiple alignment of coding sequences accounting for frameshifts and stop codons. PLoS ONE. 2011;6: e22594.
Abascal F, Zardoya R, Telford MJ. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 2010;38:W7-13.
Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17:540–52.
Tzou PL, Huang X, Shafer RW. NucAmino: a nucleotide to amino acid alignment optimized for virus gene sequences. BMC Bioinformatics. 2017;18:138.
Libin PJK, Deforche K, Abecasis AB, Theys K. VIRULIGN: fast codon-correct alignment and annotation of viral genomes. Bioinformatics. 2019;35:1763–5.
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics. 2018;34:2490–2.
Funding
This work was supported by the Labex “Who AM I?”, ANR-11-LABX- 0071 and the Université Paris Cité, Idex ANR-18- IDEX-0001, funded by the French Government through its “Investments for the Future” program.
Author information
Authors and Affiliations
Contributions
JNL: conceptualization (equal), software (lead), writing – review & editing (equal). FG: conceptualization (equal), funding acquisition (equal), supervision (equal). VCO: conceptualization (equal), funding acquisition (equal), supervision (equal), writing – review & editing (equal), GA: conceptualization (equal), funding acquisition (equal), supervision (equal), writing – original draft (lead).
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lorenzi, JN., Graner, F., Courtier-Orgogozo, V. et al. CNCA aligns small annotated genomes. BMC Bioinformatics 25, 89 (2024). https://doi.org/10.1186/s12859-024-05700-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-024-05700-1