Models of deletion for visualizing bacterial variation: an application to tuberculosis spoligotypes

Background Molecular typing methods are commonly used to study genetic relationships among bacterial isolates. Many of these methods have become standardized and produce portable data. A popular approach for analyzing such data is to construct graphs, including phylogenies. Inferences from graph representations of data assist in understanding the patterns of transmission of bacterial pathogens, and basing these graph constructs on biological models of evolution of the molecular marker helps make these inferences. Spoligotyping is a widely used method for genotyping isolates of Mycobacterium tuberculosis that exploits polymorphism in the direct repeat region. Our goal was to examine a range of models describing the evolution of spoligotypes in order to develop a visualization method to represent likely relationships among M. tuberculosis isolates. Results We found that inferred mutations of spoligotypes frequently involve the loss of a single or very few adjacent spacers. Using a second-order variant of Akaike's Information Criterion, we selected the Zipf model as the basis for resolving ambiguities in the ancestry of spoligotypes. We developed a method to construct graphs of spoligotypes (which we call spoligoforests). To demonstrate this method, we applied it to a tuberculosis data set from Cuba and compared the method to some existing methods. Conclusion We propose a new approach in analyzing relationships of M. tuberculosis isolates using spoligotypes. The spoligoforest recovers a plausible history of transmission and mutation events based on the selected deletion model. The method may be suitable to study markers based on loci of similar structure from other bacteria. The groupings and relationships in the spoligoforest can be analyzed along with the clinical features of strains to provide an understanding of the evolution of spoligotypes.


Background
The visualization of relationships among genotypes of bacterial isolates is a useful approach to addressing both evolutionary and epidemiological questions. Inferences from graph representations of data assist in understanding the patterns of transmission of bacterial pathogens. Presently, there are two approaches to visualization. The first class of methods is sequence-based, and these methods often produce phylogenetic trees, or dendrograms. These trees are used extensively to represent relatedness of iso-lates that have been identified by almost any typing procedure. However, the models of sequence evolution upon which phylogenetic methods depend are not appropriate for many markers that are not sequence-based. Also, because many of these markers evolve rapidly enough to generate intra-specific variation, it is preferable to show direct relationships between genotypes. Bacterial isolates often cluster into the same genotype, and dendrograms are not suited to showing these clusters.
The second class of methods produce "network-like" graphs that show direct relationships between clusters of genotypes. Some examples of this second class of methods are found in the works of Zhu et al. [1], Tanaka and Francis [2] and Excoffier and Smouse [3]. The genoclouds in Zhu et al. [1] consist of related isolates of Neisseria meningitidis that are grouped according to criteria that minimize genetic, temporal and physical distances. The result is a parsimonious tree that depicts the relationships between the genoclouds. Similarly, Tanaka and Francis [2] proposed cluster-graphs where isolates of Mycobacterium tuberculosis sharing the same genotype are assigned into clusters, and all possible close relationships between these clusters are shown. In both these methods, clusters are associated with epidemiologically linked cases of infection. Guernier et al. [4] developed a technique of representation that is based on the cluster-graph, with two additional elements included: (1) concentric circles to show the number of possible mutation steps between spoligotypes, and (2) hypothetical intermediate nodes to visualize possible links between clades of spoligotypes known to be related. Excoffier and Smouse [3] used an analysis of molecular variance to construct minimum spanning trees and networks to represent genetic relatedness. See Posada et al. [5] for a more general discussion of graphical methods to represent relationships.
The eBurst package [6,7] is designed for visualizing data from multi-locus sequence typing (MLST). Isolates that have similar sequence types are assigned to disjoint groups, where similarity depends on the number of shared alleles in the MLST profile. The radial layout of eBurst diagrams shows groups or subgroups of related genotypes, centered around the inferred founding genotypes. The complexity of an eBurst diagram suggests the age of the clonal complex; a clonal complex is considered young when its structure is simple and older when its structure is complex. There are many genotyping technologies enabling the study of genetic variation in bacteria.
Here, we focus on spacer oligonucleotide typing (spoligotyping), a technique that exploits polymorphism in the direct repeat (DR) region of M. tuberculosis [8,9]. This method has gained widespread use for differentiating isolates of M. tuberculosis over the last decade [10]. The DR region is composed of numerous identical 36-base-pair direct repeats, interspersed by nonrepetitive short sequences or direct variable repeats (DVRs) called spacers. Mechanisms known to cause variation in this locus are homologous recombination between adjacent or spatially distant DRs leading to deletion, and transposition and recombination of IS6110 elements in the DR locus [9,11].
The DR locus has been identified as a hotspot for the integration of insertion elements in the chromosome of M. tuberculosis complex strains [12]. Such insertion into a spacer sequence can lead to the apparent deletion of that spacer [13]. It is presumed that spacers cannot be recovered when lost, since there is little or no recombination observed between strains [14].
It is possible that there is a relationship between the presence of some number of spacers in specific positions and the transmission rate of a strain, as seen by the lack of a length of spacers in the W-Beijing strain, which is prevalent in many data sets. However, in our model, we assume that a deletion event has no relation with the transmission rate, and any such relationship is beyond the scope of our study. We refer the reader to papers that discuss the importance of studying the W-Beijing type and its transmission [15,16].
In this paper we examine the problem of determining a plausible evolutionary history of a sample of tuberculosis spoligotypes using an explicit model of the evolution of the DR locus. We start with the cluster-graph construct of Tanaka and Francis [2] to represent all possible mutation events in a sample of spoligotypes. Nodes of a clustergraph represent distinct spoligotypes in a sample, and edges drawn between nodes determine the possible mutation events. By mutation we mean a deletion of one or more spacers in a spoligotype. Even for moderately large samples, this can lead to a tangled network of relationships between spoligotypes, which can hinder further analysis. In particular, many spoligotypes appear to have originated from multiple parent spoligotypes. One solution to this problem is to randomly sample edges from a set of multiple inbound edges [17]. However, some edges (mutation events) may be more likely than others to explain the origin of a given spoligotype. We formulate a variety of models to describe the deletion processes that generate variation in the DR locus, and identify an appropriate model using Akaike's Information Criterion. The selected model can then be used to choose a single inbound edge into a specific spoligotype. Applying this procedure to each spoligotype with multiple inbound edges in a sample, we can refine the cluster-graph. We call the resulting graph a spoligoforest.

Methods
We present several candidate deletion models of spoligotype evolution, then compare them using a second-order form of the Akaike's Information Criterion (AIC c ) and data from selected published spoligotype samples (see Table 1). In this section we begin with the underlying assumptions about spoligotypes and their evolution. We then outline the procedure for model selection and finally describe the models.

Assumptions on the evolution of spoligotypes
A spoligotype consists of 43 binary characters. Each binary character denotes the presence or absence of a spacer in the DR locus of M. tuberculosis. The copy number of a spoligotype refers to the number of spacers present in its binary pattern. It is assumed that a mutation event involves the deletion of any number of adjacent spacers from the spoligotype; deleted spacers are not recovered, so that the spoligotype resulting from a mutation always contains fewer spacers than the parent. For our purposes we regard the different mechanisms that influence variation in the DR locus to be indistinguishable. Furthermore, deletion is the only source of variation. In our model we assume that a deletion event has no relation with the transmission rate. We assume that the mutation rate is low enough relative to the transmission rate that infected individuals carry only a single strain of M. tuberculosis with a specific spoligotype. When this infecting strain mutates, it is replaced by a strain with a different spoligotype that has not been observed elsewhere in the sample. Consequently, in any sample, a given spoligotype can have at most one possible parent spoligotype, but potentially many descendants.

Data sets and cluster-graphs
Given that spoligotypes mutate by deletions of adjacent blocks of spacers, we would like to know whether some lengths of deleted adjacent blocks are more probable than others. Specifically, we would like to find the frequency of each deletion size. There are many published tuberculosis data sets using spoligotyping as a marker containing the information required for our purposes. We selected fifteen published data sets that provide the spoligotyping pattern of each M. tuberculosis isolate in the sample and the number of isolates that cluster into each pattern (see Table  1). We consider that individuals within a sample are sufficiently close to each other for transmission to occur. These data sets come from various parts of the world, and vary in statistical features such as RTI n-1 (in the range (0.3279,0.8055)), number of singletons (7,105), average cluster size (1.8,6.7) and θ-estimate (2.73,66.25). Some of these quantities are discussed in [2] and [10].
We use cluster-graphs as described in Tanaka and Francis [2]. We group isolates that have the same spoligotype into clusters; each cluster is drawn as a node, and a possible single-event deletion that relates two clusters (spoligotypes) is represented by a directed edge. Possible deletion events are established by pairwise comparisons of spoligotype patterns. We define a spoligoforest to be a clustergraph in which a single parent is chosen for each cluster having one or more parents. Some clusters in a clustergraph already have a unique parent, and are likely to represent true deletions. This set of unambiguous deletion events forms the sample of deletion lengths for the model selection. Table 2 (column 4) shows the number of such edges from each data set. We assume that mutations occur independently of the state of the population, and hence edges, which represent mutations, are independent. The edges from the different data sets, representing independent deletion events, are pooled together in order to analyze the frequency of deletion lengths. We obtain an empirical distribution of deletion lengths represented by the unambiguous edges from the fifteen data sets. The total pool of analyzed unambiguous edges consisted of 339 deletion events.

Candidate models for spacer deletion length in pattern mutations
Our goal is to find a model that best describes the underlying process generating the distribution of lengths of spacer deletions in the inferred mutation events. We formulated several candidate models based on standard discrete distributions and various hypotheses on spacer deletion lengths. For each model we found the maximum likelihood estimators (MLEs) of the parameters, analytically when possible, and numerically otherwise. Let the observed number of deletions of length i be x i , where i can take values from 1 to 43, let m be the total number of mutations analyzed (the unambiguous edges, in this case 339), and let be the mean deletion length. Let the random variable K describe the deletion length associated with a mutation event. In each of the candidate models, let P(K = k) (or P(k)) be the probability mass function. The   Pooled edges 339 a Number of single-event deletions in the cluster-graph. b Number of single-event deletions in any spoligoforest derived from the data. c Number of nodes in the cluster-graph having a single parent node.
where ρ > 1 Estimated numerically , where p is the vector of parameters, x is the frequency of deletion lengths collected from the data sets, k is the deletion length, and x k is the frequency of the class of deletions with length k.
For each of the models, we computed the value of the second-order variant of Akaike's Information Criterion (AIC c ) [18] to select a parsimonious model. The AIC c is given by the formula where is the likelihood, is the vector of parameters at the maximum likelihood, x is the frequency of deletion lengths collected from the data sets, m is the sample size (the number of edges) and s is the number of parameters in the model. Models with low relative AIC c values are favoured. A summary of the probability mass functions and (MLEs) can be found in Table 3.

Geometric model
Consider a deletion mechanism that moves along the DR region, making independent attempts to remove a spacer. Let k be the number of spacers this mechanism is able to remove, and the constant probability of the removal of a spacer is p. The probability mass function is P(k) = p k-1 (1p), where k ≥ 1 and 0 <p < 1. The MLE for p is = 1 -1/ .

Negative binomial model
To generalize the geometric model, we define the negative binomial parameters p and r. Consider a deletion mechanism that involves r rounds of spacer deletion, so that in total k spacers are removed. Each of these r rounds removes a geometrically distributed number of spacers with parameter p. The probability mass function is where k, r ≥ 1 and 0 <p < 1. The MLEs were found by solving = 1 -/ for and . This was done by considering the equations conditionally on 1 ≤ ≤ 10 to solve for the values of and inspecting the likelihood.

Conditional Poisson model
In this model, the deletion process results in the loss of k spacers distributed as a Poisson parameter λ > 0, conditional on k ≥ 1. The probability mass function is with MLE such that .

Logarithmic model
Suppose the deletion process causes the loss of k spacers, following a logarithmic distribution, given by the probability mass function with 0 <θ < 1 and k ≥ 1. The MLE is found by solving the equation for .

Zeta model
Suppose the deletion process results in the loss of k spacers distributed as a zeta parameter ρ. The probability mass function is with ρ > 1 and k ≥ 1. The MLE can be calculated numerically.

Zipf model
If we restrict the loss of k spacers to the interval 1 ≤ k ≤ 43, then the probability mass function for the zeta model can be written with a finite sum in its denominator, i.e., The MLE can be found numerically.

Uniform model
We consider a deletion process that cuts spacers in lengths distributed uniformly across deletion lengths k up to some endpoint a. The probability mass function is P(k) = 1/a where k and a are integers such that 1 ≤ k ≤ a and 1 ≤ a ≤ 43. We denote this model as Uniform (V), where the endpoint a is a parameter, and is allowed to vary. The MLE was found numerically. We also included a variation of p a 43, with MLE . We refer to this model as Uniform (F), where the endpoint a is fixed at a = 43.

Empirical model
We include in the analysis a model that completely represents the empirical values of deletion frequencies in the fifteen data sets used as reference (see Table 1). The probability mass function is where x k is the frequency of k-deletions and m is the total number of deletions, with 1 ≤ k ≤ 43. Setting the parameters to be p k for each k, the values of the MLEs are found analytically.
This model has 42 free parameters.

Results
We begin with some general observations about the relative frequencies of different deletion lengths. We discuss the outcomes of the model selection procedure, and then apply the chosen model to a new visualization method for representing relationships of isolates of tuberculosis.

Inferred pattern mutations of spoligotypes frequently involve short spacer deletions
The selected data sets independently show a high frequency of deletions of a single spacer. The pooled set of edges are shown in the gray bars of Figure 1. The 339 single-event deletions are distributed as shown in Table 4. Deletion lengths not appearing in Table 4 are not observed. The average number of spacers deleted is = 2.46 and the standard deviation is 3.376. The skewed distribution of deletion sizes indicate a high number of short deletions, and very few longer deletions. Note, it is conceivable that spoligotypes that exist in the population but not sampled are intermediate in state between two sampled spoligotypes. If such spoligotypes exist and are sampled then the distribution would shift further towards shorter deletion lengths.

Model selection
We selected the model with the lowest value of AIC c (see Equation 1). This model is the Zipf model (Equation (6)). AIC c values and parameter estimates for some of the models are shown in Table 5. Figure 1 shows a plot of the relative frequencies for deletion lengths estimated by each of the models as well as the actual empirical values (gray bars). We verified that the selected model fits the individual data sets well by repeating the analysis separately for the individual data sets. The Zipf model often has the lowest AIC c , while the logarithmic series and geometric models are selected in some individual data sets (see Table 6).

Visualizing relationships among spoligotypes
The selected model can now be used in a method to visualize relationships among M. tuberculosis isolates. For a specific data set consisting of M. tuberculosis isolates typed using spoligotyping, we represent each spoligotype by a node with area proportional to the number of isolates with that spoligotype pattern. Inferred possible mutation events are represented by directed edges between nodes, with the arrowheads pointing to descendant spoligotypes. This specifies the cluster-graph [2].
Multiple inbound edges into a node are reduced to a single inbound edge. We use a heuristic that chooses a single inbound edge that has maximum weight. We define the weight w of an edge e AB in a cluster-graph from spoligotype A to its descendant B to be: where P(·) is the model of deletion, d(·) is the deletion length represented by the edge, n i is the cluster size of spoligotype i, and S is the set of nodes that are possible parent nodes of spoligotype B. Ties in the maximum weight are broken arbitrarily. The resulting graph is what we refer to as a spoligoforest. Code for automatically constructing cluster-graphs and spoligoforests from a sample of tuberculosis spoligotypes was implemented using the visualization software GraphViz [19]. The method has been implemented on a web server and is publicly available at http://emi.unsw.edu.au/spolTools (see [20] for details).

Application of the method to tuberculosis spoligotypes
We applied our new method for constructing forests to several data sets. To illustrate the method, we first use published data from a study on the transmission of tuberculosis in Cuba [21]. Isolates collected over a year were typed using both spoligotyping and IS6110 typing. Onehundred and fifty-seven isolates were classified into 47 spoligotype patterns. The clusters of isolates sharing the same spoligotype are nodes in the diagram, labelled using shared type (ST) numbers in SpolDB4 [22] wherever possible. When the spoligotype is not found in SpolDB4, we labelled it as 'Or' with a number (e.g. Or1). Orphan spoligotypes are unique alleles without an ST number [23]. Following the description in Tanaka and Francis [2], we constructed the cluster-graph for these data in a hierarchical layout as shown in Figure 2, with edges labelled with the weights computed using our selected model. The size of each node reflects the number of isolates in that node. The resulting graph is a complex network showing all possible relationships of spoligotypes under our assumptions about the spoligotype mutation processes. The Zipf model is used to calculate the weights of the edges, as given in Equation (7), of the cluster-graph. In this cluster-graph, there are 19 nodes with multiple inbound edges. The nodes are labelled according to the shared type (ST) identifiers used in SpolDB4 [22]. For example, ST 718 has 18 possible parents, while STs 47, 1, 62, 791, 2, 132 and 209, each has 3 possible parents. Of the 83 edges in the clustergraph, 37 were chosen for the spoligoforest (see Figure 3). As with the cluster-graph, the nodes in the spoligoforest represent the number of isolates that share the same spoligotype pattern. If the weight of the edge is equal to 1, we draw a solid edge, if the weight is greater than or equal to 0.5 but less than 1, a dashed line, and if less than 0.5, a dotted line.   a disconnected node representing 6 isolates, is also of the Haarlem family. The separate smaller tree on the right includes ST 1 with 20 isolates. This is the spoligotype of the W-Beijing strain, known to be widely distributed around the world.

Comparative analyses
In this section, we compare the spoligoforest to two other methods of visualisation, namely phylogenies and cladograms. We illustrate that using models with AIC c values close to that of the Zipf model has minimal effect on the edges of a spoligoforest.
The branches in a phylogeny show indirect relationships between isolates via implicit common ancestors, whereas the edges in the spoligoforest describe direct relationships among clusters of spoligotypes.
However, related spoligotypes in the spoligoforest are consistent with inferences on clustered isolates from a phylogeny. Figure 4 shows a phylogenetic tree based on IS6110-typing and Figure 5 is a spoligoforest using data from a prison in Azerbaijan [24]. The tree depicts genetic relatedness of isolates with each other based on similarities of IS6110 banding. The leaves of this tree have been renamed using STs (shared types from SpolDB4) of the spoligotypes, so that isolates sharing the same spoligotype may appear in different leaves of the tree. An inspection of the branch lengths in the phylogenetic tree indicate that ST 42 is most related to ST 254 (2 isolates of differing IS6110 bands.). The spoligoforest in Figure 5  A similar network-like technique of visualization to the spoligoforest is the cladogram in Figure 6 of [25]. The method of construction of the cladogram involves using information from nested clades and geographic location. The main difference between the cladogram and the spoligoforest is that the cladogram involves the introduction of intermediate steps between types, accounting for possibly unsampled spoligotypes. The spoligoforest for this data set is shown in Figure 6. The LAM3 and LAM9 groups identified in the cladogram are also evident as a subtree in the spoligoforest, with ST 42 at the top of this subtree (see highlighted region in Figure 6).   Length  1  2  3  4  5  6  9  10  11  12  13  14  15  16  17  26  32 F r e q u e n c y 1 9 9 6 1 2 1 2 5 4 9 3 2 5 1 1 2 2 1 1 1 1 In order to assess the choice of model among the best four, we applied the method to several data sets using a range of possible models. This procedure has revealed that model selection has minimal impact on the edges of spoligoforests. We constructed the spoligoforests for six data sets, using the zeta, logarithmic series, geometric and empirical models. Table 7 shows the number of differing edges in spoligoforests constructed from these alternative models, relative to that constructed using the selected model (Zipf). Clearly, the Zipf and zeta models are similar, as the only difference between them is that the domain of the Zipf distribution is finite (see Table 3). The spoligoforest for the data set from Madagascar [26] using the selected Zipf model is shown in Figure 7. The spoligoforest using a logarithmic series model (Figure 8) for the same data set differs from Figure 7 by 4 edges, the highest number of edge differences among the data sets and models we tested.

Discussion
This paper proposes a new method of visualizing the relationships among genotypes of tuberculosis by selecting a Cluster-graph of Cuban data in Diaz et al The spoligoforest generated from the Cuban data in Diaz et al Figure 3 The spoligoforest generated from the Cuban data in Diaz et al.
[21] Edges with weights less than 0.5 are drawn as dotted lines, those with weights greater than 0.5 but less than 1 are dashed, and those where no decision was required to be made are solid. For example, ST 1 is resolved to have mutated from Or1, and is drawn as a dashed line because it has a weight equal to 0.5582. As in the cluster-graph, the lengths of edges do not represent evolutionary distance. In this website, users can search through the repository of spoligotype data sets in spolTools as well as manipulate their own data sets. These data sets can be processed to construct spoligoforests.
A spoligoforest recovers a plausible history of transmission and mutation events. The area of each node is proportional to the number of isolates (cluster size); edges between nodes reflect evolutionary relationships between spoligotypes with arrowheads pointing to descendants. A single edge is chosen from multiple inbound edges using the deletion model, resulting in a forest -that is, a collection of acyclic graphs, or trees.
Information about the age of a spoligotype is contained in three aspects of a spoligoforest. First, if its node is large, the strain with that spoligotype may have been transmitted extensively over a long time. Second, a large number of descendants (outbound edges) suggests the strain has had a long period over which to generate mutations. Third, the location of a node also offers clues as to age: the closer it is to a root node, the older it is. For example, ST 1 in Figure 3 is a root and potentially the oldest spoligotype in this forest.
If a spoligotype node size is large yet located at a tip of the spoligoforest, this mixed signal may indicate that the strain with the spoligotype is transmitting faster than the other strains in the data set [17]. Or2 1 ( 46 ) this case, therefore, there is no evidence for the presence of emerging strains. Note that the algorithm for choosing edges proposed in this study could be used to refine the method of Tanaka and Francis [17].
One way to improve the analysis of strain age and emergence would be to consider spoligotypes in conjunction with other molecular markers. For example, consider the same two spoligotypes discussed above (ST 42 and ST 81 Spoligoforest of Caribbean data in Duchene et al Figure 6 Spoligoforest of Caribbean data in Duchene et al. [25]. Spoligoforest for the Caribbean data set (from Cuba, Haiti and French Antilles) in [25]. The clustering of spoligotypes from the LAM group appears in the subtree with ST 42 at the top. This group is highlighted in the spoligoforest, and corresponds to clade 2-1 discussed in [25], which includes the LAM3 and LAM9 families.  Figure 8.
from Figure 3). ST     may occasionally occur, but it is likely to be infrequent. The occurrence of homoplasy may have only a minor effect on graph-construction, producing a small number of cycles if such events could be properly identified. Second, we always choose one edge (parent) among possible inbound edges into a given spoligotype. It is conceivable, however, that the given spoligotype did not descend from any of the candidate parents. An improvement to the method would incorporate a procedure for not choosing any edges when appropriate. Third, as in any statistical analysis involving samples of data, there could be a bias in sampling. An overrepresentation of a spoligotype in a sampled data set can lead to biased selection of a parent node. Fourth, our methodology cannot be applied to markers such as Variable Numbers of Tandem Repeats (VNTR), which is commonly used to type various bacteria. The mutation process for VNTRs is better modeled using a stepwise mutation model rather than a deletion model.

Spoligoforest of Madagascar data in Ferdinand et al
Our method may, however, be suitable for markers based on loci of similar structure in some other bacteria. The direct repeat region of M. tuberculosis is among a family of repetitive genome sequences that are called Clustered Regularly Interspersed Short Palindromic Repeats (CRISPRs) found in many different species of bacteria and archaea [27][28][29]. Recently, CRISPR systems have received increased attention due to evidence that links these loci with the acquisition of resistance in bacteria to infection by phages [ The mechanisms that are believed to be involved in the evolution of CRISPR systems involve a frequent deletion of spacer-repeat motifs (thought to be necessary to prevent over-inflation of the CRISPR locus [29]) as well as the insertion of new spacers next to the leader sequence due to uptake of phage DNA [33].
Typing methods similar to spoligotyping for other bacterial isolates with CRISPRs are being developed. One such typing method (also called spoligotyping) has been applied to Corynebacterium diphtheriae strains, in which the location and structure of two CRISPR loci have been identified [34]. These loci consist of 27 spacers (the DRA with 21 spacers and the DRB with 6 spacers) in two different regions of the genome. The spoligotyping method used in this particular study is similar to the method used for M. tuberculosis. At present, there is yet to be an analysis of the evolution of these DR loci in C. diphtheriae. It has also been speculated that in some CRISPRs, new repeat motifs can appear, like those in Yersinia pestis [33]. Investigations into how these loci evolve may allow the development of methods similar to that described here.
As with other visualization methods, the groupings and relationships that are seen in the spoligoforest can be analysed along with the known clinical features of strains. Such analyses are valuable when an understanding of the history of transmission and mutation of strains is required.

Conclusion
There is a lack of tools for visualizing relationships among tuberculosis isolates that employ a model describing evolution of a specific marker. Current understanding of the evolution of spoligotypes led us to a method for visualizing relationships of isolates within a sample. The methodology presented in this paper may be applied to loci that have the same structure as the DR region of Mycobacterium tuberculosis, and whose evolution involves the deletion of spacer-repeat motifs. The groupings and relationships that are seen in the spoligoforest can be analysed along with the clinical features of strains to understand the evolution of strains.