Using affinity propagation for identifying subspecies among clonal organisms: lessons from M. tuberculosis
 Claudio Borile^{1, 2},
 Mathieu Labarre^{1},
 Silvio Franz^{1},
 Christophe Sola^{3, 4} and
 Guislaine Refrégier^{3}Email author
DOI: 10.1186/1471210512224
© Borile et al; licensee BioMed Central Ltd. 2011
Received: 18 January 2011
Accepted: 2 June 2011
Published: 2 June 2011
Abstract
Background
Classification and naming is a key step in the analysis, understanding and adequate management of living organisms. However, where to set limits between groups can be puzzling especially in clonal organisms. Within the Mycobacterium tuberculosis complex (MTC), the etiological agent of tuberculosis (TB), experts have first identified several groups according to their pattern at repetitive sequences, especially at the CRISPR locus (spoligotyping), and to their epidemiological relevance. Most groups such as "Beijing" found good support when tested with other loci. However, other groups such as T family and T1 subfamily (belonging to the "EuroAmerican" lineage) correspond to nonmonophyletic groups and still need to be refined. Here, we propose to use a method called Affinity Propagation that has been successfully used in image categorization to identify relevant patterns at the CRISPR locus in MTC.
Results
To adequately infer the relative divergence time between strains, we used a distance method inspired by the recent evolutionary model by Reyes et al. We first confirm that this method performs better than the Jaccard index commonly used to compare spoligotype patterns. Second, we document the support of each spoligotype family among the previous classification using affinity propagation on the international spoligotyping database SpolDB4. This allowed us to propose a consensus assignation for all SpolDB4 spoligotypes. Third, we propose new signatures to subclassify the T family.
Conclusion
Altogether, this study shows how the new clustering algorithm Affinity Propagation can help building or refining clonal organims classifications. It also describes wellsupported families and subfamilies among M. tuberculosis complex, especially inside the modern "EuroAmerican" lineage.
Keywords
asexual organisms species delineation epidemiology DR locus Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)Background
The advent of powerful genotyping methods, either by global sequencing or by highthroughput analysis of variation at specific loci (mini or microsatellites [1]; CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci [2, 3]; SNPs [4]), provides masses of genetic data that need to be compared and clustered. Most widely used comparison methods are phylogenetic methods i.e. hierarchical clustering, building treelike structures to display the diversity. These methods consider that each individual forms a cluster and repeatedly merge the most similar clusters based on pairwise distances (Phenetics such as NeighbourJoining), or try to infer the tree that most fits the data (Cladistics such as Maximum Likelihood, Bayesian analysis) using an appropriate evolutionary model of the compared characters. This provides a continuous pattern of how divergent organisms are. Other comparison methods consist in finding relevant clusters, a process referred to as partitioning. A method made popular by the software Structure [5], and referred to as modelbased clustering, consists in using Bayesian methods to assign individuals in a predetermined number of populations. The assumption underlying this software is that the population conforms HardyWeinberg hypotheses i.e. refers to organisms reproducing sexually, with random pairing inside the population. This assumption is theoretically very problematic for clonal organisms, although practice has shown that it can provide meaningful results [6], partly because some parameters can be set to mimic poor mixture inside populations. Other methods have been developed outside biology, for instance to categorize images [7, 8]. They use similarity to group data in spherical clusters well represented by their centroid (also called representative or exemplars), and have already been tentatively used to classify human genetic data [9]. This method awaits further experimental validation on large datasets.
Clustering methods can be applied to different types of loci, ranging from repetitive sequences such as insertion sequences, micro, minisatellites or the CRISPR loci to single nucleotide polymorphisms (SNPs), provided an appropriate method is available to calculate the distance between individuals. Such methods usually rely on a model of the mutation process. Which loci should be targeted depends on the mean divergence time between individuals, as repetitive sequences mutate faster than SNP loci. Several mutation models have been developed for DNA sequence with point mutations [10]. For repetitive sequences (micro and mini satellites), categorical distance or the Stepwise Mutation Model (SMM [11]) are mostly used.
CRISPR loci (Clustered Regularly Interspaced Short Palindromic Repeats) form a new family of repetitive sequences [12, 13]. They consist in the repetition of 24 to 47 bp sequences called Direct Repeats (DR) separated by unique sequences called spacers (from 26 to 72 bp). The constitutive unit is therefore the combination of one DR and one spacer, and presently described CRISPR loci have from 2 to 249 units [13]. These repetitions are surrounded by proteinencoding sequences called cas genes (derived from C RISPRas sociated genes). The whole locus confers resistance to bacteriophages and plasmids in Streptococcus thermophilus[14, 15] and in Escherichia coli when overexpressed [16]. Resistance to the corresponding organisms is under investigation in species where spacers are homologous to foreign DNA [17]. They exhibit a quite high mutation rate so that they have proven useful for epidemiological studies. Describing the presence or absence of 43 spacers of M. tuberculosis CRISPR locus has become a routine technique in any tuberculosis reference center and is referred to as spoligotyping for sp acer oligo nucleotide typing[18]. Pairwise comparisons of binary profiles can provide a distance matrix that has been used to infer phylogenetic relationships. The most common approach to infer relationships using spoligotype patterns uses the Jaccard index (same principle as Hamming distance or Dice coefficient) as distance [19], counting the proportion of spacers that are present in both profiles. The distance matrix, either made of pure spoligotyping data or combining them with minisatellite data, is usually processed using UPGMA or NJ algorithm to build a dendrogram or a phylogenetic tree [20]. A more elaborate approach using the Zipf distribution and the evolutionary dynamics of CRISPR loci has proven more relevant to infer phylogenetic relationships for the M. tuberculosis complex [21] but is not implemented in a userfriendly software yet and does not propose assignations for all currently described spoligotype patterns.
The worldwide database of spoligotyping in M. tuberculosis complex is called SpolDB (the latest public version being SpolDB4), and has helped identifying recurrent signatures in CRISPR profiles [22–24]. These signatures, mainly based on the absence of adjacent spacers, led to the naming of large clonal families, the monophyly of which has been confirmed through other markers such as minisatellites (referred to as MIRUVNTR for MycobacterialInterspersedRepetitiveUnitsVariable Number of Tandem Repeats), Region of Deletions (RDs) and SNPs [6, 25, 26]. Main acknowledged families are EAI for EastAfricanIndian (later referred to as "IndoOceanic" by Gagneux et al.), M. africanum 1 and 2 (later "Westafrican 2" and "WestAfrican 1"), animal strains (M. bovis, M. caprae, M. pinnipedii, M. microtii), CAS for "CentralAsia" (later "EastAfricanIndian"), Beijing, X, Haarlem, LAM for "LatinoAmerican and Mediterranean", T and MANU (also designated as T ancestor) [23, 27]. Monophyly of each of the LAM, T and Haarlem families has been partly invalidated. However, the larger lineage to which they belong and that is characterized by the 3336 spacers deletion at the CRISPR profile (merging T, LAM, X, Haarlem families and S subfamily) has been confirmed and designated as the "EuroAmerican" lineage [27]. It corresponds to the Principal Genetic Groups (PGG) 2 and 3 as defined by Sreevatsan et al.[28]. Altogether deletions in spoligotype patterns have proven to contain phylogenetic information and allow most strains be assigned to the families described above. Assignations performed by experts are reported in SpolDB4 database, patterns carrying no or contradictory signatures been labeled as "U" for "Unknown or Unclassified". To circumvent the dependence on experts' analysis, the Bennett's laboratory proposed automatized classification of spoligotype patterns using Bayesian algorithms and a distance method taking into account the deletion process by which spoligotype patterns evolve. They provide an online tool called Spotclust [29] to assign each spoligotype to a family, either one described in SpolDB3 [30] or one of the 6 large families proposed by Gagneux and coworkers [31].
Here we wanted to take advantage of a recently developed algorithm, Affinity Propagation, to confirm and extend these methods. This algorithm identifies references for every data point so that data are grouped and centered on these references while a specific cost function is minimized. The cost of adding a new reference point, assigned by the user, determines the final number of clusters. Prior to the use of this algorithm, we tested different distances to calculate pairwise distances between spoligotype patterns. We took advantage of previously identified references and expert assignation to rank these distances, some of which are derived from previously proposed evolutionary models [21, 31]. The distance that best identified the appropriate reference for each spoligotype pattern was implemented in the Affinity Propagation algorithm to identify relevant subfamilies among M. tuberculosis complex (MTC). These families partly correlated with previously identified subfamilies.
Altogether, this approach allowed us to assess the robustness of previously identified sublineages among MTC, to identify new relevant sublineages and to provide reassignations of the spoligotype patterns described in SpolDB4. These reassignations interestingly matched those of studies using VNTR and/or SNP data.
Results
Comparison of classifications based on new distances or on Jaccard index to expert classification of SpolDB4
References of the ten best acknowledged M. tuberculosis complex families
SIT  SpolDB4 classification  Reference Spoligotype pattern  

family  subfamily  
1  BEIJ  BEIJ 

26  CAS  CAS1 

42  LAM  LAM9 

50  H  H3 

53  T  T1 

100  MANU  MANU1 

119  X  X1 

236  EAI  EAI5 

181  AFRI  AFRI1 

482  animal  BOV1 

"Deletions" method succeeds in correcting false SpolDB4 assignations
Assignations of LAM7, H4 and selected "U" spoligotype patterns from SpolDB4, according to different methods.
spoligotype pattern  SpolDB4  Recent litterature  Deletions  Domain Walls  SpotClust subfamily  

SIT  family  Subfamily  assignation  family  family  SpolDB3based  RIM  
41 
 LAM  LAM7  TTUR  T  LAM  LAM9  N19 
186 
 LAM  LAM7  TTUR  T  LAM  LAM9  N19 
367 
 LAM  LAM7  TTUR  T  LAM  LAM9  N19 
930 
 LAM  LAM7  TTUR  T  LAM  LAM1  N19 
1261 
 LAM  LAM7  TTUR  T  LAM  LAM9  N19 
1589 
 LAM  LAM7  TTUR  T  LAM  LAM3  N2 
1924 
 LAM  LAM7  TTUR  T  LAM  LAM9  N19 
1937 
 LAM  LAM7  TTUR  T  LAM  LAM9  N19 
35 
 H  H4  TUral  T  H  H3  N34 
262 
 H  H4  TUral  T  H  H3  N34 
361 
 H  H4  TUral  T  H  H3  N34 
399 
 H  H4  TUral  T  H  T2  N34 
596 
 H  H4  TUral  T  H  H3  N34 
597 
 H  H4  TUral  T  H  H3  N34 
656 
 H  H4  TUral  T  H  H3  N34 
762 
 H  H4  TUral  T  H  H3  N34 
777 
 H  H4  TUral  T  H  H3  N34 
817 
 H  H4  TUral  T  H  H3  N34 
920 
 H  H4  TUral  T  H  H3  N34 
921 
 H  H4  TUral  T  H  H3  N34 
922 
 H  H4  TUral  T  H  H3  N34 
1117 
 H  H4  TUral  T  H  H3  N34 
1134 
 H  H4  TUral  T  H  H3  N34 
1165 
 H  H4  TUral  T  H  H3  N34 
1174 
 H  H4  TUral  T  H  H3  N34 
1242 
 H  H4  TUral  T  H  U  N34 
1269 
 H  H4  TUral  T  H  H3  N34 
1276 
 H  H4  TUral  T  H  H3  N34 
1281 
 H  H4  TUral  T  H  H3  N34 
1292 
 H  H4  TUral  T  H  H3  N34 
1447 
 H  H4  TUral  T  H  H3  N34 
1448 
 H  H4  TUral  T  H  H3  N34 
1457 
 H  H4  TUral  T  H  H3  N34 
1568 
 H  H4  TUral  T  H  H3  N34 
1581 
 H  H4  TUral  T  H  H3  N34 
1384 
 H  H4  TUral  T  U  T3  N40 
1446 
 H  H4  TUral  T  U  H3  N34 
1452 
 H  H4  TUral  T  U  H3  N34 
1455 
 H  H4  TUral  T  U  U  N34 
1456 
 H  H4  TUral  T  U  H3  N34 
1461 
 H  H4  TUral  T  U  H3  N34 
1480 
 H  H4  TUral  T  U  LAM9  N19 
105 
 U  U  H  U  Afri  LAM7  N29 
1274 
 U  U  LAM  U  Afri  H1  N5 
1531 
 U  U  X  X  X  X1  N44 
Interestingly, Beijing, X and EAI families exhibited no incongruence between the "Deletions" and the expert method (no light gray box), suggesting that these families are clearly and appropriately defined. As reported above (Figure 2), the Jaccard method failed to assign most spoligotype patterns to any family; for instance, no spoligotype patterns could be assigned to BEIJ, EAI or X families (Figure 3A) with a maximum similarity to any reference not reaching 20% for BEIJ family (Additional file 1). "Domain Walls" and "Blocks" methods provided either poor resolution between correctly and wrongly assigned spoligotype patterns (Figure 3A and 3C), or a lower number of families with no discrepancy with the expert classification (only the X family with the "Domain Walls" method, Figure 3B).
Assignations of U spoligotype patterns
Automatized identification of references by Affinity Propagation clustering
The "Deletions" method is highly useful to classify spoligotype patterns in the described families, but this classification highly depends on the identification of references. These references are widely acknowledged for major families but the relevance of finer classification is recurrently debated [25, 27, 35]. Affinity propagation is an algorithm that identifies a representative (also called exemplar) for each data point in an iterative manner until the chosen configuration of exemplars minimizes a suitable cost function that depends on the choice of the clusters (see Methods). A parameter set by the user (that we denote as 'penalty', p) determines an additional cost for every exemplar found. When p is low (high negative value), large clusters are built where some data points have relatively low similarity with their representative. As p increases, the clusters reach smaller sizes so that they become
References after Affinity Propagation clustering for n_{clusters} = 12.
APfamily  Reference  Majoritary SpolDB4 family  

SpolDB4 subfamily  SIT  Spoligotype pattern  Family  Proportion in the APfamily  Total Nb  
animal1(Bov13CapMicPin)  bovis_1  482 
 Animal  0.888  206 
animal2(Bov2)  bovis_2  683 
 Animal  0.621  66 
Beijafri  BEIJ  255 
 Afri  0.339  56 
CAS  CAS_1  26 
 CAS  0.760  96 
EAI  EAI_5  236 
 EAI  0.84  250 
H12  H_1  47 
 H  0.853  68 
H 3  H_3  50 
 H  0.874  111 
LAM _{(931164)}  LAM_9  42 
 LAM  0.721  179 
T2  T_2  52 
 T  0.545  145 
T3LAM_{(25)}  LAM_2  17 
 LAM  0.432  148 
T(UralH3LAM_{ 107) })  T_1  53 
 T  0.823  351 
S(&U)  S  34 
 T  0.554  74 
T(&U)  T_1  173 
 T  0.420  81 
X  X_1  119 
 X  0.75  108 
References after Affinity Propagation clustering for n_{clusters} = 32.
APsubfamily naming  reference  Most represented subfamilies  Nb of spoligotype patterns  

Classical subfamily naming  SIT  spoligotype (43 format)  First most represented subfamily  Second most repr. family  
Subfamily  Prop.  
Afri1  AFRI1  181 
 AFRI1  0.531  AFRI  32 
Afri23  AFRI2  331 
 AFRI2  0.364  AFRI3  22 
Beij  BEIJ  1 
 BEIJ  0.842  U  19 
Bov13  BOV1  482 
 BOV1  0.585  BOV  159 
Bov2  BOV2  683 
 BOV2  0.467  BOV  45 
Cap  CAP  647 
 CAP  0.75  U  20 
CAS  CAS1  26 
 CAS1  0.487  CAS  80 
EAI1  EAI1  48 
 EAI1  0.804  U  46 
EAI35 (del23373839)  EAI2  11 
 EAI5  0.383  EAI3  55 
EAI2 (del32021)  EAI2  19 
 EAI2  0.5  U  48 
EAI  EAI5  236 
 EAI5  0.651  EAI4  86 
EAI6 (del2337)  EAI6  292 
 EAI6  0.5  EAI5  42 
H12  H1  47 
 H1  0.790  U  62 
H3  H3  50 
 H3  0.927  U  96 
Ural  H4  262 
 H4  0.714  U  28 
LAM521(del313)  LAM2  17 
 LAM5  0.207  U  92 
LAM3  LAM3  33 
 LAM3  0.455  U  33 
LAM  LAM9  42 
 LAM9  0.574  LAM11  136 
Manu  MANU2  54 
 MANU2  0.793  U  29 
PinMic  PIN  637 
 BOV  0.391  U  23 
S  S  34 
 S  0.678  U  59 
T (T1H3Lam10Cam)  T1  53 
 T1  0.828  H3  261 
T1a (del54043)  T1  833 
 T1  0.484  U  31 
T1b (del21)  T1  291 
 U  0.367  T1  30 
T1c (del15)  T2  118 
 T1  0.432  U  37 
T2 (del40)  T2  52 
 T2  0.521  U  119 
T3 (del13)  T3  37 
 T3  0.373  U  59 
T4 (del1923243839)  T4  39 
 T1  0.406  T4  32 
T5 (del23)  T5  44 
 T5  0.561  U  41 
X  X1  119 
 X1  0.492  U  61 
X2  X2  137 
 X2  0.824  T1  34 
SEA1 (del2934)  U  458 
 U  0.955  CAS  22 
Spoligotype patterns clustered with SIT 458 with Affinity Propagation when n_{clusters} = 32.
SIT  Spoligotype pattern  SpolDB4 assignation  Main country 

458* 
 U  THA 
354 
 U  GBR 
526 
 U  GNB 
527 
 U  GNB 
863 
 U  BRA 
1172 
 U  EST 
1186 
 U  THA 
1187 
 U  THA 
1374 
 U  MYS 
1386 
 U  BGD 
1436 
 U  BGD 
1462 
 U  GEO 
1515 
 U  MDG 
1518 
 U  MDG 
1519 
 U  MDG 
1520 
 U  MDG 
1521 
 U  MDG 
1524 
 U  MDG 
710 
 U  NLD 
405 
 U  VNM 
426 
 CAS_2  USA 
523 
 U  MYS 
Discussion
Here we first validated a simple distance method that can be used to classify CRISPR genetic profiles based on a worldwide M. tuberculosis spoligotype database. Second, using a recent clustering algorithm exploiting a different approach with respect to those commonly used in the biological sciences community, we could identify an alternative M. tuberculosis classification. The comparison between the largely validated expert classification described in the international database SpolDB4 and our alternative classification validates our approach for M. tuberculosis CRISPR profiles, opening the way for its use for other bacterial species where CRISPR were shown to provide interesting typing information [16].
Clustering power of CRISPR patterns
M. tuberculosis complex (MTC) has been infecting humans for at least 2600 years [37] and could be 20,000 years old or even much older [6, 38]. Despite its restricted genetic diversity even between human and animal strains [39, 40], phylogenetical relationships have been detected using polymorphic DNA sequences [41, 42]. CRISPR loci characterized using the "spoligotyping" technique have been used to define families through the use of socalled signatures i.e. the absence and presence of characterized units of CRIPSR loci, the spacers [43]. Most of these families found independent support such as host range or congruence with independent genetic markers [22, 23, 44], even SNPs and Regions of Deletion [26, 34, 45]. However, some of them were "illdefined" i.e. had a signature that was shared by several other groups, and others were defeated by independent loci: H4 subfamily was renamed Ural as it was related to T strains and not H strains [34, 36], LAM7 and LAM10 were renamed TUR and Cameroon respectively as they are unrelated to LAM strains [6, 34, 35]. As a consequence, the use of CRISPR patterns to infer phylogenetical relationship was recurrently discussed [44, 46].
We used here an automatized approach for clustering CRISPR patterns. Our clusters largely reproduced the wellacknowledged MTC families and provided meaningful clustering for Ural, TUR and Cameroon. In fact, the misclassification of Ural among Haarlem family was due to the merging of all signatures having spacer 31 deleted and spacer 32 present disregarding the left border of the deletion. This classification criterion is not relevant knowing the evolutionary dynamics of CRISPR loci due either to the insertion of IS6110 elements or to the deletion of one of several adjacent spacers. This kind of errors is avoided if comparison is performed using an algorithm identifying complete signatures (left and right borders of the deletions) as included in our automatized approach (see below for a detailed discussion on methods used to calculate distances between strains).
Still, the fact that some families are "illdefined" is an intrinsic problem of spoligotyping: CRISPR loci in M. tuberculosis are relatively short and they evolve at a rate that cannot exclude the absence or the insufficient number of mutation in some lineages. This intrinsically limits the power of our study, i.e. we cannot classify all strains, and not all of them with the same precision. However, this problem does not affect the assignation quality of the strains we classify which are in fact numerous (more than 80%).
We thus argue that CRISPR profiles evolving by the insertion of transposable elements or by deletion such as those of M. tuberculosis contain relevant information for clustering and even inference of some phylogenetic links. The targeted locus must however not be missing for the individuals to be classified. To avoid this pitfall, the use of CRISPR loci should restrict to recently diverged groups as is the M. tuberculosis species complex (more than 99.9% identity). Such organisms uncover diverse human pathogens such as Yersinia pestis, Salmonella enterica, Bacilllus anthracis, Mycobacterium leprae and Mycobacterium ulcerans. Still, the use of CRISPR profiles in phylogenetic reconstructions would benefit from further developments and validations for species with still active CRISPR loci.
Distance methods for CRISPR profiles
If CRISPR can be used to infer phylogenetic relationship, the evolutionary model or distance method used during the inference is also of great importance. Several developments had been proposed until now. We want to discuss here what our approach adds to previous ones.
CRISPR profiles (spoligotype patterns) form a sequence of binary data. As such, it has been analyzed with tools developed for binary information such as the Jaccard Index that focuses on the sharing of every unit in the profile (here the spacers) taken independently. This however ignores an essential feature of the corresponding CRISPR locus: that it evolves by the loss of spacers. These losses can occur either because of the insertion of a transposable element that disrupts the sequence used in the spoligotyping technique, or by deletion. Deletions can occur for several spacers at once, even if the frequency of large deletions may be lower [21]. As a consequence, the distance between two patterns, one carrying many spacers and the other carrying one large deletion, should not be considered as proportional to the number of spacers that were lost (as done by the Jaccard index), but as corresponding to a single mutation event. The methods proposed by the Bennett laboratory [30, 31] take into account the deletion process and add a probability function that best mimics the frequency of deletion size. In Spotclust, a Bayesian algorithm incorporating the inference of ancestral spoligotype patterns based on SpolDB3 database is used to assign spoligotypes to SpolDB3 subfamilies or to families built using a Randomly Initialized Model (RIM) [30]. We showed here that, by simply using expertdefined references of main families and the "Deletions" distance method that is based on deletion signatures, we could better assign Unknown spoligotype patterns than Spotclust as well as correct previous erroneous assignations in SpolDB4 classification such as those to LAM7(TUR) [29]. For Spotclust algorithm, this was true for both the SpolDB3based classification and the Randomly Initialized Model. The reason for that could be either that the size of the database used by Spotclust was too small to capture evolutionary steps relevant to MTC evolution, or that Bayesian statistical inferences are too dependent on priors.
Performance of the Affinity propagation algorithm on CRISPR profiles clustering
Affinity Propagation is a messagepassing algorithm that considers clustering as a problem of minimizing an "energy" function of the clusters configuration in the data set (see Methods section for a general review of the algorithm, and [8]). This approach seems particularly promising and could help solving species delineations in asexual lineages where obligate gene exchange cannot be used as a delineation criterion [9]. One of the main features of the algorithm is the possibility of regulating the total number of clusters as a function of an input parameter of the algorithm (called the "chemical potential" μ, by analogy to the chemical potential of physical systems, or p for penalty parameter, see Methods). Also the high speed (the computational time goes as N^{2} if N is the size of the dataset) and thus the possibility to analyze very large networks is encouraging the use of this algorithm. With this method we identified both families and subfamilies in MTC. A single family out of 14 made no sense (Beijingafricanum). This is likely due to a lack of information in Beijing spoligotype pattern as the large 136 deletion limits the recognition of other signatures. When considering patterns carrying a larger number of spacers, the classification was largely congruent with the literature. In addition, we could identify new signatures, especially one, termed SEA1, among previously unclassified patterns. We therefore believe that this algorithm is very useful for classifying the widely used 43spoligotype patterns in M. tuberculosis but could prove even more useful on patterns larger than 43 spacers, e.g. the improved 68 spacers format.
"Euroamerican" lineage evolution
Despite large sequencing efforts [25, 47], there has been a standing difficulty in unraveling the relationships inside "EuroAmerican" lineage strains (carrying the 3336 deletion in the spoligotype pattern), especially in the socalled "T family" described in SpolDB4 [23]. Here, using SpolDB4 database, we could challenge expertdefined families and subfamilies. We first confirmed the validity of S and T2 subfamilies that we suggest to consider as families. The S family was first described in Sicily [48] and independently identified in Québec where a sublineage was shown to harbor a peculiar pncA SNP [49]. The T2 family, defined by the absence of spacer 40 was originally described as M. africanum 2, however was shown later to be a bona fide M. tuberculosis subfamily [50]. We also confirmed the reliability of Haarlem family subclustering, if renaming H4 as Ural, and suggest considering H12 and H3 as two families. We confirmed the validity of T3 and T5 families as well as T4CEU (although T4 alone was invalidated). Some LAM subfamilies renamings based on VNTR and SNP loci [34–36] are given further support (LAM7 as TUR, LAM10 as Cameroon), while other were merged (LAM1, 2 and 5). The tendency to merge many expertdefined families was not pervasive. Indeed in the EAI family, four subfamilies out of the 5 expertdefined ones were confirmed.
Conclusion
This study describes 1) a novel distance method to be applied on genetic loci evolving by deletion, as for instance do inactive CRISPR loci, 2) a framework to take advantage of identified references for classifying individuals using such loci, 3) a way to identify new references using the Affinity Propagation algorithm [8], and 4) assignations and assignation tools for M. tuberculosis complex. The distance method and the framework to identify known references were largely validated by worldwide M. tuberculosis database at the CRISPR locus (spoligotype patterns). This work encourages the use of CRISPR patterns to cluster strains in other organisms carrying such loci and for which wide genotyping has been undertaken as it is now the case for human pathogens such as Yersinia pestis, Salmonella enterica, Bacilllus anthracis, Mycobacterium leprae and M. ulcerans. Affinity propagation could also be useful to cluster other genotyping data such as SNPs or minisatellites. Databases larger than those available by now are however required to test the validity of this method on such markers.
Methods
Spoligotyping data
SpolDB4[33], containing 1939 shared international spoligotypes (SIT) was used as raw data for spoligotyping diversity analysis. This database contains family or subfamily information, with some uncertainties indicated such as LAM3 and Sconvergent or T1T2. To simplify the analyses, when two assignations were provided, only the first one was kept. We also merged certain families when the number of spoligotype patterns was very small and not been confirmed by SNP typing [40]. Specifically, the families we retained are: africanum (n = 46), animal strains (grouping BOVIS, PINNIPEDII, MICROTII, CAPRAE, n_{tot} = 231), Beijing (n = 21), CAS (n = 86), EAI (n = 213), H (n = 233), LAM (n = 224), MANU (n = 39), T (n = 482) including S and H37Rv ST as suggested by Brudey et al (2006), X (n = 90). We excluded SIT69 which was suppressed by Institut Pasteur [33] as well as the canetti spoligotype pattern which is unique (SIT592). There are 272 unclassified spoligotypes (U). The references for each family correspond to SpolDB4 description [23]; they are listed in Table 1.
Methods to compute distances
These distance methods were used to compute the distance between each SpolDB4 spoligotype pattern and the references of the ten main M. tuberculosis complex families. The scripts for such calculations were written in R language [51] and are available upon request. Each pattern was assigned to the family whose reference it was the most similar to.
Clustering algorithms
and by taking the log function of it and summing it over all the nodes, so that the energy becomes infinite if at least an exemplar is represented by a different exemplar. The parameter β plays formally the role of the inverse of the temperature in a thermodynamical system, and thus determines the level of thermal noise acting on the system. This means that varying this parameter, but keeping it finite, allows the algorithm to accept configurations of the clusters that do not exactly corresponds to minima of the energy function. We consider here only the optimal case of zero temperature, i.e., β → ∞ (for a general and exhaustive treatment of the cavity method see, for example [52]).
These update rules represents messages that the nodes are exchanging between iteration t and t+1, with and representing respectively the energetic "competition" between node i and all the other nodes except j to be the exemplar of node j and the gain in the total energy of the system if node j is represented by node i. The notation i → j indicates that the message is sent from node i toward node j. When the update equations converge, then the value of each c_{ i } , i = 1,..., N, is obtained summing over all the messages arriving at node i and maximizing the sum. The diagonal elements of the similarity matrix, that are not constrained to be equal to the unity, play the role of an effective cost to every chosen exemplar, and thus a cost for the total number of clusters found. They are a finetuning for the selection of the total number of clusters found by AP. In our study we chose to consider every data point a priori equally probable to be an exemplar, so we set S(i,i) = p ∀ i = 1,..., N. Varying the parameter p from very large (negative) values up to positive values gives the range of total clusters from 1 to N, and we interpret a stability in the total number of clusters under changes of this parameter as a genetically meaningful grouping of the data, as discussed in the Results section. The similarity matrix S was obtained using the "Deletions" distance that had turned out to be the most accurate distance. Linear combinations of the various distances introduced in the previous section were also considered, but the overall result still favors the Deletions distance. We performed also a comparison of AP with other "classical" clustering algorithms, such as KMeans and Hierarchical clustering. We considered the performance with respect to the experts' classification as defined in SpolDB4 [23, 33] and identified that AP found clusters with much lower error (see Additional File 3). The script for computing the distance matrices of SpolDB4 database and performing the analysis with AP was written in Matlab as a selfcontained script, the bare AP algorithm for Matlab is available from the authors Frey and Dueck.
Declarations
Acknowledgements
Marc Mézard is acknowledged for fruitful discussions on the design of the study. Edgar Abadia, Jian Zhang and Michel Gomgnimbou are acknowledged for discussions on the design of the manuscript. CS and SF acknowledge financial support of Univ. ParisSud in the form of a « Chaire d'excellence » respectively in Physics (SF) and Microbiology (CS).
Authors’ Affiliations
References
 Le Flèche P, Fabre M, Denoeud F, Koeck J, Vergnaud G: High resolution, online identification of strains from the Mycobacterium tuberculosis complex based on tandem repeat typing. BMC Microbiol 2002, 2: 37. 10.1186/14712180237PubMed CentralView ArticlePubMedGoogle Scholar
 Pourcel C, Salvignol G, Vergnaud G: CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology 2005, 151(Pt 3):653–663.View ArticlePubMedGoogle Scholar
 Zhang J, Abadia E, Refregier G, Tafaj S, Boschiroli ML, Guillard B, Andremont A, Ruimy R, Sola C: Mycobacterium tuberculosis complex CRISPR genotyping: improving efficiency, throughput and discriminative power of 'spoligotyping' with new spacers and a microbeadbased hybridization assay. J Med Microbiol 2009, 59((Pt 3)):285–94.PubMedGoogle Scholar
 Deshpande A, Gans J, Graves SW, Green L, Taylor L, Kim HB, Kunde YA, Leonard PM, Li PE, Mark J, et al.: A rapid multiplex assay for nucleic acidbased diagnostics. Journal of Microbiological Methods 2010, 80(2):155–163. 10.1016/j.mimet.2009.12.001View ArticlePubMedGoogle Scholar
 Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics 2000, 155(2):945–959.PubMed CentralPubMedGoogle Scholar
 Wirth T, Hildebrand F, AllixBeguec C, Wolbeling F, Kubica T, Kremer K, van Soolingen D, RuschGerdes S, Locht C, Brisse S, et al.: Origin, spread and demography of the Mycobacterium tuberculosis complex. PLoS Pathog 2008, 4(9):e1000160. 10.1371/journal.ppat.1000160PubMed CentralView ArticlePubMedGoogle Scholar
 MacQueen J: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1967. Univ of California Press, Berkeley, CA; 1967.Google Scholar
 Frey BJ, Dueck D: Clustering by passing messages between data points. Science 2007, 315(5814):972–976. 10.1126/science.1136800View ArticlePubMedGoogle Scholar
 BaillyBechet M, Bradde S, Braunstein A, Flaxman A, Foini L, Zecchina R: Clustering with shallow trees. Journal of Statistical MechanicsTheory and Experiment 2009.Google Scholar
 Posada D, Crandall KA: MODELTEST: testing the model of DNA substitution. Bioinformatics 1998, 14: 817–818. 10.1093/bioinformatics/14.9.817View ArticlePubMedGoogle Scholar
 Ellegren H: Microsatellites: Simple sequences with complex evolution. Nature Reviews Genetics 2004, 5(6):435–445.View ArticlePubMedGoogle Scholar
 Jansen R, van Embden JD, Gaastra W, Schouls LM: Identification of a novel family of sequence repeats among prokaryotes. Genomics 2002, 6(1):23–33.Google Scholar
 Sorek R, Kunin V, Hugenholtz P: CRISPRa widespread system that provides acquired resistance against phages in bacteria and archaea. Nat Rev Microbiol 2008, 6(3):181–186. 10.1038/nrmicro1793View ArticlePubMedGoogle Scholar
 Barrangou R, Fremaux C, Deveau H, Richards M, Boyaval P, Moineau S, Romero DA, Horvath P: CRISPR provides acquired resistance against viruses in prokaryotes. Science 2007, 315(5819):1709–1712. 10.1126/science.1138140View ArticlePubMedGoogle Scholar
 Garneau JE, Dupuis ME, Villion M, Romero DA, Barrangou R, Boyaval P, Fremaux C, Horvath P, Magadan AH, Moineau S: The CRISPR/Cas bacterial immune system cleaves bacteriophage and plasmid DNA. Nature 2010, 468(7320):67+. 10.1038/nature09523View ArticlePubMedGoogle Scholar
 Liu F, Barrangou R, GernerSmidt P, Ribot EM, Knabel SJ, Dudley EG: Novel Virulence Gene and Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) Multilocus Sequence Typing Scheme for Subtyping of the Major Serovars of Salmonella enterica subsp enterica . Appl Environ Microbiol 2011, 77(6):1946–1956. 10.1128/AEM.0262510PubMed CentralView ArticlePubMedGoogle Scholar
 Mojica FJ, DiezVillasenor C, GarciaMartinez J, Soria E: Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol 2005, 60(2):174–182. 10.1007/s0023900400463View ArticlePubMedGoogle Scholar
 Kamerbeek J, Schouls L, Kolk A, van Agterveld M, van Soolingen D, Kuijper S, Bunschoten A, Molhuizen H, Shaw R, Goyal M, et al.: Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J Clin Microbiol 1997, 35(4):907–914.PubMed CentralPubMedGoogle Scholar
 Kremer K, van Soolingen D, Frothingham R, Haas WH, Hermans PW, Martin C, Palittapongarnpim P, Plikaytis BB, Riley LW, Yakrus MA, et al.: Comparison of methods based on different molecular epidemiological markers for typing of Mycobacterium tuberculosis complex strains: interlaboratory study of discriminatory power and reproducibility. J Clin Microbiol 1999, 37(8):2607–2618.PubMed CentralPubMedGoogle Scholar
 Tafaj S, Zhang J, Hauck Y, Pourcel C, Hafizi H, Zoraqi G, Sola C: First insight into genetic diversity of the Mycobacterium tuberculosis complex in Albania obtained by multilocus variablenumber tandemrepeat analysis and spoligotyping reveals the presence of beijing multidrugresistant isolates. J Clin Microbiol 2009, 47(5):1581–1584. 10.1128/JCM.0228408PubMed CentralView ArticlePubMedGoogle Scholar
 Reyes JF, Francis AR, Tanaka MM: Models of deletion for visualizing bacterial variation: an application to tuberculosis spoligotypes. BMC Bioinformatics 2008, 9: 496. 10.1186/147121059496PubMed CentralView ArticlePubMedGoogle Scholar
 Filliol I, Driscoll JR, Van Soolingen D, Kreiswirth BN, Kremer K, Valétudie G, Anh DD, Barlow R, Banerjee D, Bifani PJ, et al.: Global distribution of Mycobacterium tuberculosis spoligotypes. Emerg Inf Dis 2002, 8(11):1347–1349.View ArticleGoogle Scholar
 Brudey K, Driscoll J, Rigouts L, Prodinger WM, Gori A, AlHajoj SAM, Allix C, Aristimuno L, Arora J, Baumanis V, et al.: Mycobacterium tuberculosis complex genetic diversity: mining the fourth international spoligotyping database (SpolDB4) for classification, Population Genetics, and Epidemiology. BMC Microbiol 2006, 6(6):23.PubMed CentralView ArticlePubMedGoogle Scholar
 Sola C, Devallois A, Horgen L, Maisetti J, Filliol I, Legrand E, Rastogi N: Tuberculosis in the Caribbean: using spacer oligonucleotide typing to understand strain origin and transmission. Emerg Infect Dis 1999, 5(3):404–414. 10.3201/eid0503.990311PubMed CentralView ArticlePubMedGoogle Scholar
 Comas I, Homolka S, Niemann S, Gagneux S: Genotyping of Genetically Monomorphic Bacteria: DNA Sequencing in Mycobacterium tuberculosis Highlights the Limitations of Current Methodologies. Plos One 2009, 4(11):e7815.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang J, Abadia E, Refregier G, Tafaj S, Boschiroli ML, Guillard B, Andremont A, Ruimy R, Sola C: Mycobacterium tuberculosis complex CRISPR genotyping: improving efficiency, throughput and discriminative power of 'spoligotyping' with new spacers and a microbeadbased hybridization assay. J Med Microbiol 2010, 59(Pt 3):285–294.View ArticlePubMedGoogle Scholar
 Gagneux S, DeRiemer K, Van T, KatoMaeda M, de Jong BC, Narayanan S, Nicol M, Niemann S, Kremer K, Gutierrez MC, et al.: Variable hostpathogen compatibility in Mycobacterium tuberculosis . Proc Natl Acad Sci USA 2006, 103(8):2869–2873. 10.1073/pnas.0511240103PubMed CentralView ArticlePubMedGoogle Scholar
 Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, Whittam TS, Musser JM: Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proc Natl Acad Sci USA 1997, 94(18):9869–9874. 10.1073/pnas.94.18.9869PubMed CentralView ArticlePubMedGoogle Scholar
 SpotClust[http://www.rpi.edu/~bennek/EpiResearch]
 Vitol I, Driscoll J, Kreiswirth B, Kurepina N, Bennett KP: Identifying Mycobacterium tuberculosis complex strain families using spoligotypes. Infect Genet Evol 2006, 6(6):491–504. 10.1016/j.meegid.2006.03.003View ArticlePubMedGoogle Scholar
 Aminian M, Shabbeer A, Bennett KP: A conformal Bayesian network for classification of Mycobacterium tuberculosis complex lineages. Bmc Bioinformatics 2010, 29(11):Suppl 3S4.Google Scholar
 Zozio T, Allix C, Gunal S, Saribas Z, Alp A, Durmaz R, FauvilleDufaux M, Rastogi N, Sola C: Genotyping of Mycobacterium tuberculosis clinical isolates in two cities of Turkey: description of a new family of genotypes that is phylogeographically specific for Asia Minor. BMC Microbiol 2005, 5: 44. 10.1186/14712180544PubMed CentralView ArticlePubMedGoogle Scholar
 SpolDB4[http://www.pasteurguadeloupe.fr:8081/SITVITDemo/]
 Abadia E, Zhang J, Vultos T, Ritacco V, Kremer K, Aktas E, Matsumoto T, Refregier G, Soolingen DV, Gicquel B, et al.: Resolving lineage assignation on Mycobacterium tuberculosis clinical isolates classified by spoligotyping with a new highthroughput 3R SNPs based method. Infect Genet Evol 2010, 10(7):1066–1074.View ArticlePubMedGoogle Scholar
 Millet J, MiyagiShiohira C, Yamane N, Sola C, Rastogi N: Assessment of mycobacterial interspersed repetitive unitQUB markers to further discriminate the Beijing genotype in a populationbased study of the genetic diversity of Mycobacterium tuberculosis clinical isolates from Okinawa, Ryukyu Islands, Japan. J Clin Microbiol 2007, 45(11):3606–3615. 10.1128/JCM.0034807PubMed CentralView ArticlePubMedGoogle Scholar
 Kovalev SY, Kamaev EY, Kravchenko MA, Kurepina NE, Skorniakov SN: Genetic analysis of Mycobacterium tuberculosis strains isolated in Ural region, Russian federation, by MIRUVNTR genotyping. Int J Tuberc Lung Dis 2005, 9(7):746–752.PubMedGoogle Scholar
 Donoghue HD, Lee OYC, Minnikin DE, Besra GS, Taylor JH, Spigelman M: Tuberculosis in Dr Granville's mummy: a molecular reexamination of the earliest known Egyptian mummy to be scientifically examined and given a medical diagnosis. Proceedings of the Royal Society BBiological Sciences 2010, 277(1678):51–56. 10.1098/rspb.2009.1484PubMed CentralView ArticleGoogle Scholar
 Gutierrez MC, Brisse S, Brosch R, Fabre M, Omais B, Marmiesse M, Supply P, Vincent V: Ancient origin and gene mosaicism of the progenitor of Mycobacterium tuberculosis . Plos Pathogens 2005, 1(1):e5.PubMed CentralView ArticlePubMedGoogle Scholar
 Brosch R, Gordon SV, Pym A, Eiglmeier K, Garnier T, Cole S: Comparative genomics of the mycobacteria. Int J Med Microbiol 2000, 290: 143–152.View ArticlePubMedGoogle Scholar
 Hershberg R, Lipatov M, Small PM, Sheffer H, Niemann S, Homolka S, Roach JC, Kremer K, Petrov DA, Feldman MW, et al.: High functional diversity in Mycobacterium tuberculosis driven by genetic drift and human demography. PLoS Biol 2008, 6(12):e311. 10.1371/journal.pbio.0060311PubMed CentralView ArticlePubMedGoogle Scholar
 van Soolingen D, de Haas PEW, Hermans PWM, Groenen PMA, van Embden JDA: Comparison of various repetitive DNA elements as genetic markers for strain differentiation and epidemiology of Mycobacterium tuberculosis . J Clin Microbiol 1993, 31: 1987–1995.PubMed CentralPubMedGoogle Scholar
 Eisenach KD, Crawford JT, Bates JH: Repetitive Sequences as Probes for Mycobacterium tuberculosis . J Clin MIcrobiol 1988, 26: 2240–2245.PubMed CentralPubMedGoogle Scholar
 van Soolingen D, Qian L, de Haas PE, Douglas JT, Traore H, Portaels F, Qing HZ, Enkhsaikan D, Nymadawa P, van Embden JD: Predominance of a single genotype of Mycobacterium tuberculosis in countries of east Asia. J Clin Microbiol 1995, 33(12):3234–3238.PubMed CentralPubMedGoogle Scholar
 Sebban M, Mokrousov I, Rastogi N, Sola C: A datamining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis . Bioinformatics 2002, 18(2):235–243. 10.1093/bioinformatics/18.2.235View ArticlePubMedGoogle Scholar
 Dos Vultos T, Mestre O, Rauzier J, Golec M, Rastogi N, Rasolofo V, Tonjum T, Sola C, Matic I, Gicquel B: Evolution and Diversity of Clonal Bacteria: The Paradigm of Mycobacterium tuberculosis . PLoS ONE 2008, 3(2):e1538. 10.1371/journal.pone.0001538PubMed CentralView ArticlePubMedGoogle Scholar
 Warren RM, Streicher EM, Sampson SL, Van Der Spuy GD, Richardson M, Nguyen D, Behr MA, Victor TC, Van Helden PD: Microevolution of the Direct Repeat Region of Mycobacterium tuberculosis : Implications for Interpretation of Spoligotyping Data. J Clin Microbiol 2002, 40: 4457–4465. 10.1128/JCM.40.12.44574465.2002PubMed CentralView ArticlePubMedGoogle Scholar
 Nahid P, Bliven EE, Kim EY, Mac Kenzie WR, Stout JE, Diem L, Johnson JL, Gagneux S, Hopewell PC, KatoMaeda M: Influence of M. tuberculosis Lineage Variability within a Clinical Trial for Pulmonary Tuberculosis. Plos One 2010, 5(5):e10753.PubMed CentralView ArticlePubMedGoogle Scholar
 Sola C, Ferdinand S, Mammina C, Nastasi A, Rastogi N: Genetic diversity of Mycobacterium tuberculosis in Sicily based on spoligotyping and variable number of tandem DNA repeats and comparison with a spoligotyping database for populationbased analysis. J Clin Microbiol 2001, 39(4):1559–1565. 10.1128/JCM.39.4.15591565.2001PubMed CentralView ArticlePubMedGoogle Scholar
 Cheng SJ, Thibert L, Sanchez T, Heifets L, Zhang Y: pncA mutations as a major mechanism of pyrazinamide resistance in Mycobacterium tuberculosis : spread of a monoresistant strain in Quebec, Canada. Antimicrob Agents Chemother 2000, 44(3):528–532. 10.1128/AAC.44.3.528532.2000PubMed CentralView ArticlePubMedGoogle Scholar
 Niemann S, Kubica T, Bange FC, Adjei O, Browne EN, Chinbuah MA, Diel R, Gyapong J, Horstmann RD, Joloba ML, et al.: The Species Mycobacterium africanum in the Light of New Molecular Markers. J Clin Microbiol 2004, 42(9):3958–3962. 10.1128/JCM.42.9.39583962.2004PubMed CentralView ArticlePubMedGoogle Scholar
 R: A Language and Environment for Statistical Computing[http://cran.rproject.org/]
 Mezard M, Montanari A: Information, Physics, and Computation. Oxford University Press; 2009.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.