### Validation

To examine the robustness of the assumption that targets may be added in a stepwise fashion while maximising the parameter of interest (heuristic search), random datasets were generated and tested using both search types. These random datasets were defined by varying a) the number of targets, b) the number of different states each target could assume, c) the number of strain types and d) the number of isolates distributed (unevenly) amongst the strain types.

For each dataset, a heuristic search was used to calculate the threshold subset size. The heuristic search result for a subset of one target less than the threshold was compared with an exhaustive search result specifying the same sized subset. If the resulting maximum parameter value, using the exhaustive search was the same as that of the heuristic search, the heuristic was considered to be valid. If the maximal parameter value achieved by the heuristic search was less than that using the exhaustive search, the heuristic was considered not to have held. 25600 randomly generated datasets were examined for each of the 5 parameters of interest. The heuristic was valid in 79.4% (95% confidence interval 79-80), 98.2% (98-99), 83.4% (0.83-0.84), 92.9% (92-93) and 93.6% (93-94) of random datasets for *D, AW*
_{(A>B)}, *AW*
_{(B>A)}, *R* and *AR*, respectively.

Factors associated with failure of the heuristic to identify the combination of targets that maximised *D* included: a value of *D* between 0.90 and 0.96, and a larger number of targets analysed. It performed best when the maximum *D* of the whole dataset was 1 (87.8% 95% CI 87-89). The number of strain types, the number of isolates in the dataset and the number of states each target could assume did not influence the likelihood of the heuristic being valid.

The heuristic performed well for all four concordance measures. Factors associated with a lower likelihood of the heuristic being valid for concordance measures included an increasing number of targets in the dataset, *D* value of the dataset between 0.9 and 0.96, examination of a subset of close to half of the total number of targets and, for *AW*
_{(A>B)}, a maximum *AW* value between 0.1-0.35.

Full details of the validation are available in the supplementary material (Additional file 2).

### Application

The software was used to analyse different forms of microbial typing data generated by well-validated methods, specifically, binary typing data for *Staphylococcus aureus*[16–18], MLVA for *Streptococcus pneumoniae*[19] and MLST for *Cryptococcus* spp. [20, 21]*.*

### Selection of targets for *Staphylococcus aureus* strain typing

Typing results for 51 binary targets in 153 methicillin-resistant *S. aureus* (MRSA) isolates (42 well characterised reference isolates and 111 clinical isolates from our institution) were available from previous experiments in our laboratory [16–18]. The targets comprised: 13 toxin genes [17], 16 phage-derived open reading frames [18] and 22 SCC*mec* elements [16] which had been interrogated using multiplex-PCR reverse line blot assays [22, 23].

The maximum *D* value of binary typing with all 51 targets for this collection of MRSA isolates was 0.984 (95% confidence interval 0.975-0.992). AuSeTTS heuristic search showed that this could be achieved with a subset of 20 binary targets, while a subset of just 7 targets achieved a *D* value of 0.954 (0.941-0.967) (Figure 2A). When used to predict MLST (which had been determined by either the conventional [24] or SNP-based [25] methods for all 153 isolates), a maximum Adjusted Wallace coefficient of concordance (*AW*) of 0.9994 (0.999-1.000) was achieved with 12 targets (Figure 2B). One binary type consisted of two isolates with different MLST (which were single-locus variants). Isolates within each of the remaining binary types all belonged to one MLST type.

This data was used to develop a novel 19-target binary typing system for MRSA [26].

### Selection of targets for *Streptococcus pneumoniae* strain typing

Results of MLVA typing, using 17 loci, for 1449 *Streptococcus pneumoniae* isolates (representing 906 possible MLVA types) were available from the MLVA online database (http://www.mlva.eu) [19] for analysis by AuSeTTS. A maximum *D* of 0.997 (0.997-0.998) was achieved with all 17 loci but only 4 targets were required to achieve a *D* value of 0.990 (0.988-0.991), which divided the isolates into 438 MLVA types.

A subset of the isolates for which MLVA results were available also had been serotyped (537 isolates representing 43 serotypes and 398 MLVA types), and these we used to determine the combination of MLVA loci which could best predict the serotype. A maximum *AW* of 0.899 (0.857-0.942) for serotype was achieved using 12 of the MLVA loci. This particular combination of 12 targets divided the dataset into 370 MLVA types, 352 of which contained only one serotype, while 15 contained two, two contained one and one MLVA type represented by 6 isolates harboured 5 different serotypes.

A similar analysis was performed with MLST data which were available for 96 of the isolates consisting of 27 sequence types (ST) and 77 possible MLVA types. A maximum *AW* of 0.963 (0.943-0.983) for MLVA to predict ST was achieved with 9 targets which divided the 96 isolates into 60 MLVA types. One MLVA type consisted of 3 isolates with 3 different MLST types. All other MLVA types consisted of isolates with matching MLST types.

### Selection of targets for *Cryptococcus* species strain typing

Twelve MLST loci for 98 *Cryptococcus* spp. isolates from a previously published study [21] were examined using AuSeTTS. Eight of the 12 MLST loci provided a maximum *D* of 0.963 (0.945-0.981) for *Cryptococcus* spp.in a heuristic search. The exhaustive search, specifying a subset size of seven loci, indicated the same maximal *D* value could be achieved with only seven loci; i.e. for this dataset, the heuristic was invalid but the most informative combination of targets could still be identified using an exhaustive search. This analysis was used, in part, to determine the recommended targets for an international consensus protocol for MLST typing of *Cryptococcus* spp. [27].

### Discussion

AuSeTTS has been successfully applied to develop typing schemes for MRSA [26] and *Cryptococcus* spp. [27] and would be useful to assess the discriminatory power of combinations of candidate targets for typing systems for other pathogens. It can be used for a wide range of data types, but for interrogation of informative SNPs, we recommend Minimum SNPs, which has been designed specifically for this purpose [6, 7]. Minimum SNPs should be used to examine input data in the form of multiple sequence alignments. AuSeTTS can also be used to examine the level of concordance between results produced using subsets of candidate targets and those of existing phenotyping or genotyping methods or with epidemiologic classifications. Minimum SNPs does provide some functionality with regard to concordance measures (the “not-N” mode), but does not calculate the Wallace or Rand coefficients or confidence intervals for the adjusted Wallace coefficient.

While the algorithm used in the heuristic search may not always provide a definitive result for the minimum subset size required for the maximal *D* value, it will be correct in the majority of cases. For smaller datasets, an exhaustive search can easily be undertaken to confirm the validity of the heuristic. This is particularly recommended if the dataset has several features that were associated with a higher likelihood of the heuristic being invalid, such as low maximum *D* values, a threshold value close to 50% of the total number of targets, a number of states each target can assume of <8 and a large number of unique strain types. A worked example demonstrating the use of AuSeTTS (Additional file 3) using a sample dataset (Additional file 4) accompany this paper.