Skip to main content

Normalizing alternate representations of large sequence variants across multiple bacterial genomes

Background and description

Variant-focused comparative genomics enables researchers to study the evolution of distinct genetic characteristics in bacterial populations, while avoiding the difficulties of whole-genome assembly and alignment. A major challenge in using this method is that many variant detecting tools are largely limited to predicting single nucleotide variants (SNVs) and small indels. This is a challenge because bacterial organisms do not only possess SNVs but also harbor much larger sequence variants (LSVs), such as large indels and substitutions (>25 nt), when compared to a reference genome. LSVs have been shown to play a role in shaping important biological aspects such as virulence and drug resistance as well as reporting on population structure [13]. Recent variant callers, such as Pilon, can identify LSVs with single nucleotide accuracy in microbial genomes. However, one remaining challenge is that identical LSVs can be represented non-identically by a single variant detecting tool; this generally results from similarity in the flanking sequence of the variant and variability of the read quality and alignment information in that region across the different strains. As a result, alternate representations of large variants make it difficult to perform downstream analyses - such as association studies - that depend on consistent representations of variants.

We present Emu, an algorithm that resolves alternate representations of LSVs by comparing variant calls across genomes.


To evaluate Emu's ability to resolve alternate representations of LSVs, we introduced 179 simulated LSVs into the H37Rv genome--a carefully curated and finished reference genome for Mycobacterium tuberculosis (Mtb). We then used Pilon to identify variants in a set of 146 clinical samples of Mtb that were collected in China using the modified H37Rv genome as a reference [4]. We identified a total of 10,001 unique variant representations. The average number of non-identical representations of each simulated LSV was 56 (in the range of 1 to 145). We then applied Emu to identify the non-identical representations across the genomes of the 146 clinical samples and canonicalize them to a single form. Emu reduced the total number of non-identical representations to 676 LSVs bringing the average number of non-identical representations at each LSV to 4, with 15 LSVs reduced to a single representation and no LSV having more than 25 representations.

We then investigated how Emu's ability to resolve alternate representations might impact association analyses, e.g., associating LSVs with population structure. We ran Pilon again on the set of 161 clinical samples from China, but used the unmodified H37Rv genome. Pilon identified a total of 20,512 distinct LSVs when compared to the unmodified H37Rv genome. By applying Emu, the number of distinct LSVs decreased by almost 50% to 10,936 LSVs. Emu also increased the power of association tests on the LSVs. While we initially identified a total number of 69 LSVs that were significantly associated (p < 0.01) with membership to a specific clade, after processing with Emu that number increased to 94.


Emu enables comprehensive analysis of LSVs in bacterial genomes by reducing the cross-sample noise that results from per-sample variant calls. By normalizing our variant calls with Emu, we increased our power to utilize LSVs association tests. Pilon and Emu are open source tools that can also be applied to identify and normalize variants in other organisms.


  1. Alland D, Lacher DW, Hazbón MH, Motiwala AS, Qi W, Fleischmann RD, Whittam TS: Role of large sequence polymorphisms (LSPs) in generating genomic diversity among clinical isolates of Mycobacterium tuberculosis and the utility of LSPs in phylogenetic analysis. J Clin Microbiol. 2007, 45: 39-46. 10.1128/JCM.02483-05.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Maurelli AT, Fernández RE, Bloch CA, Rode CK, Fasano A: "Black holes" and bacterial pathogenicity: a large genomic deletion that enhances the virulence of Shigella spp. and enteroinvasive Escherichia coli. Proc Natl Acad Sci USA. 1998, 95: 3943-3948. 10.1073/pnas.95.7.3943.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  3. Mutreja A, Kim DW, Thomson NR, Connor TR, Lee JH, Kariuki S, Croucher NJ, Choi SY, Harris SR, Lebens M, Niyogi SK, Kim EJ, Ramamurthy T, Chun J, Wood JLN, Clemens JD, Czerkinsky C, Nair GB, Holmgren J, Parkhill J, Dougan G: Evidence for several waves of global transmission in the seventh cholera pandemic. Nature. 2011, 477: 462-5. 10.1038/nature10392.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Zhang H, Li D, Zhao L, Fleming J, Lin N, Wang T, Liu Z, Li C, Galwey N, Deng J, Zhou Y, Zhu Y, Gao Y, Wang T, Wang S, Huang Y, Wang M, Zhong Q, Zhou L, Chen T, Zhou J, Yang R, Zhu G, Hang H, Zhang J, Li F, Wan K, Wang J, Zhang X-E, Bi L: Genome sequencing of 161 Mycobacterium tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance. Nat Genet. 2013, 1-8. September

Download references

Author information

Authors and Affiliations


Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Salazar, A., Earl, A., Desjardins, C. et al. Normalizing alternate representations of large sequence variants across multiple bacterial genomes. BMC Bioinformatics 16 (Suppl 2), A8 (2015).

Download citation

  • Published:

  • DOI: