PhyTB: Phylogenetic tree visualisation and sample positioning for M. tuberculosis
BMC Bioinformatics volume 16, Article number: 155 (2015)
Phylogenetic-based classification of M. tuberculosis and other bacterial genomes is a core analysis for studying evolutionary hypotheses, disease outbreaks and transmission events. Whole genome sequencing is providing new insights into the genomic variation underlying intra- and inter-strain diversity, thereby assisting with the classification and molecular barcoding of the bacteria. One roadblock to strain investigation is the lack of user-interactive solutions to interrogate and visualise variation within a phylogenetic tree setting.
We have developed a web-based tool called PhyTB (http://pathogenseq.lshtm.ac.uk/phytblive/index.php) to assist phylogenetic tree visualisation and identification of M. tuberculosis clade-informative polymorphism. Variant Call Format files can be uploaded to determine a sample position within the tree. A map view summarises the geographical distribution of alleles and strain-types. The utility of the PhyTB is demonstrated on sequence data from 1,601 M. tuberculosis isolates.
PhyTB contextualises M. tuberculosis genomic variation within epidemiological, geographical and phylogenic settings. Further tool utility is possible by incorporating large variants and phenotypic data (e.g. drug-resistance profiles), and an assessment of genotype-phenotype associations. Source code is available to develop similar websites for other organisms (http://sourceforge.net/projects/phylotrack).
Strain-specific genomic diversity in the Mycobacterium tuberculosis complex (MTBC) is an important factor in tuberculosis pathogenesis that may affect virulence, transmissibility, host response and emergence of drug resistance [1,2]. Some modern strains (e.g. Beijing, Euro-American, Haarlem) are believed to exhibit more virulent phenotypes compared to ancient ones (e.g. East African, Indian, M. africanum) . M. tuberculosis is relatively clonal, with little recombination and a low mutation rate . Like other bacterial genomic settings, the construction of phylogenetic trees using sequence data facilitates taxonomic localisation and the evolutionary analysis. The growing availability of M. tuberculosis whole genome sequences is leading to the full characterisation of single nucleotide polymorphisms (SNPs) and other nucleotide variation, such as insertions and deletions (indels). A SNP–based barcode has been developed to discriminate strain-types . Trees constructed using genome-wide variation have greater discriminatory power than traditional genotyping approaches such as MIRU-VNTR and spoligotyping . Clades reflecting strain type variations may be used to investigate disease outbreaks or transmission events, where samples are identified through apparent identical genomic signatures [5,6]. The tree also provides a structure to identify variants that can be used to investigate clinically important traits such as drug resistance . The primary mechanism for acquiring resistance is the accumulation of point mutations in genes coding for drug-targets or -converting enzymes (e.g. katG, inhA, rpoB, pncA, embB, rrs, gyrA, gyrB genes) , and these mutations may exist in multiple lineages in the tree, reflecting homoplasy events. Some mutations thought to be related to drug resistance are actually not, but instead strain-informative . With the increased application of sequencing technologies within clinical and microbiological research settings, it is important that informatic tools are available to identify informative strain-type and drug resistance related variants. Web-browsers for the visualisation of M. tuberculosis genomic variation exist [8-10], but there is limited connectivity with phylogenetic trees and downstream analysis, especially involving strain-types and drug resistance. In addition, there is little provision for uploading new data, such as standard variant call files (VCFs) (www.htslib.org). Here we present the PhyTB tool, which facilitates the phylogenetic exploration of M. tuberculosis isolates, including the display of clade-specific informative and drug resistance markers and their genomic annotation. Using the browser, it is possible to upload multiple standard genomic variant call files (VCF format) to identify the closest relative within the M. tuberculosis complex global phylogeny, thereby potentially assisting their interpretation in a clinical or epidemiological context. Source code is available to facilitate the development of sites for other organisms with genomes that can be represented in a phylogeny.
Results and discussion
PhyTB uses 1,601 global MTBC whole-genome sequences from 11 studies with representation across all 7 major lineages (lineage 1 - 7.6%, 2 - 24.3%, 3 - 11.8%, 4 - 53.5%, 5-7 2.8%). The phylogenetic tree constructed using the 91 k SNPs shows the expected clustering by lineage and strain-type (Figure 1). SNP information is displayed at internal nodes of the tree, therefore distinguishing between unique strain-defining mutations from those arising in multiple branches (homoplastic mutations). The homoplastic mutations arise due to recombination or convergent evolution, potentially related to drug resistance. Figure 1 shows a deep phylogenetic SNP (R463L) in the katG gene that is present across all lineages except lineage 4. This SNP has been historically and mistakenly thought to cause isoniazid resistance. PhyTB displays whether polymorphisms have been previously related to drug resistance  or are strain informative  in tracks, and meta data (e.g. codon, amino acid) is shown by selecting the polymorphism of interest. It is possible to move from the tree view to a geographical map showing allele frequencies. A map view, accessed through the genome browser located below the tree, shows a SNP at position 762,434 in rpoB, a gene associated with rifampicin resistance. The alternative allele leads to a synonymous mutation (G876G) that is fixed in CAS (lineage 3) strains in Malawi (Figure 2) and all other study sites. To demonstrate the VCF positioning functionality, we used 100 M. tuberculosis samples [ENA:ERP000192] of known strain-type , not included in the phylogeny. It was possible to unambiguously position all of them in the tree. Figure 3 shows the result of uploading the VCF file for a Russian sample [ENA:ERR019571], which has 5067 SNPs, allowing it to be positioned correctly in a Beijing clade.
The PhyTB web-browser attempts to contextualise TB genomic variation within epidemiological, geographical and phylogenic settings. To assist with integrating such data for other organisms, we provide the source code, which has been packaged in the PhyloTrack library. In pathogenic bacteria like M. tuberculosis, data integration is crucial to distinguish drug-resistance mutations from phylogenetic markers, to study the transmission of outbreak strains, to detect the source of an infection, inform patient management and design appropriate infection control measures (e.g. rapid tests). Further tool utility is possible by extending it to incorporate large variants and phenotypic data (e.g. drug-resistance profiles).
Availability and requirements
Reiling N, Homolka S, Walter K, Brandenburg J, Niwinski L, Ernst M, et al. Clade-specific virulence patterns of mycobacterium tuberculosis complex strains in human primary macrophages and aerogenically infected mice. mBio. 2013; 4(4):00250–13.
Coll F, McNerney R, Guerra-Assuncao JA, Glynn JR, Perdigao J. A robust snp barcode for typing mycobacterium tuberculosis complex strains. Nat Commun. 2014; 5:4812.
Ford CB, Shah RR, Maeda MK, Gagneux S, Murray MB, Cohen T, et al. Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug-resistant tuberculosis. Nat Genet. 2013; 45:784–90.
Coll F, Mallard K, Preston M, Bentley S, Parkhill J. Spolpred: Rapid and accurate ascertainment of mycobacterium tuberculosis strain types from short genomic sequences. Bioinformatics. 2012; 28:2991–3.
Clark TG, Mallard K, Coll F, Preston M. Transmission of multidrug-resistant tuberculosis in treatment experienced patients. PLoS One. 2013; 8(12):83012.
Guerra-Assunção JA, Houben RM, Crampin AC, Mzembe T, Mallard K, Coll F, Khan P, Banda L, Chiwaya A, Pereira RP, McNerney R, Harris D, Parkhill J, Clark TG, Glynn JR. Recurrence due to relapse or reinfection with Mycobacterium tuberculosis: a whole-genome sequencing approach in a large, population-based cohort with a high HIV infection prevalence and active follow-up. J Infect Dis. 2015; 211(7):1154–63.
Sandgren A, Strong M, Muthukrishnan P, Weiner BK, Church GM. Tuberculosis drug resistance mutation database. PLoS Med. 2009; 6(2):2.
Chernyaeva EN, Shulgina MV, Rotkevich MS, Dobrynin PV. Genome-wide mycobacterium tuberculosis variation (gmtv) database: A new tool for integrating sequence variations and epidemiology. BMC Genomics. 2014; 15:308.
Coll F, Preston MD, Guerra-Assuncao JA, Glynn JR, Perdigao J. Polytb: A genomic variation map for mycobacterium tuberculosis. Tuberculosis. 2014; 94(3):346–54.
Wattam AR, Abraham D, Dalay O, Disz TL. Patric, the bacterial bioinformatics database and analysis resource. Nucl Acids Res. 2014; 42(D1):581–91.
Bostock M. D3.js - data driven documents. http://d3js.org/, (last modified June 21, 2014).
Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH. Jbrowse: a next-generation genome browser. Genome Resh. 2009; 19(9):1630–8.
Li H. Tabix: fast retrieval of sequence features from generic tab-delimited files. Bioinformatics (Oxford, England). 2011; 27(5):718–9.
Coll F, McNerney R, Preston MD, Guerra-Assuncao JA, Warry A, Hill-Cawthorne G, et al. Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences. Genome Med. 2015.
This work has been supported by Bloomsbury Research Fund, Medical Research Council UK and Wellcome Trust.
The authors declare that they have no competing interests.
EDB developed the software under the supervision of FC, NF, FRM and TGC. FC and NF contributed scripts. FC, RM, JRG, SC and AP contributed data, The first draft of the manuscript was prepared by EDB, FC and TGC, with contributions from all authors to the final version. The final manuscript has been approved by all authors. All authors read and approved the final manuscript.
About this article
Cite this article
Benavente, E.D., Coll, F., Furnham, N. et al. PhyTB: Phylogenetic tree visualisation and sample positioning for M. tuberculosis . BMC Bioinformatics 16, 155 (2015). https://doi.org/10.1186/s12859-015-0603-3